### Q1: What is data encoding? How is it useful in data science?

A1. **Data Encoding** is the process of converting categorical data into a numerical format so that machine learning algorithms can process it. Categorical data often contains labels or categories that need to be transformed into a numerical form to be included in the model. This process is essential in data science as many machine learning algorithms require numerical input.

### Q2: What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

A2. **Nominal Encoding** (also known as label encoding) assigns a unique integer to each category in a categorical feature. It is useful when there is no ordinal relationship between the categories.


In [1]:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample data
data = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})

# Applying Label Encoding
encoder = LabelEncoder()
data['Color_Encoded'] = encoder.fit_transform(data['Color'])

print(data)


   Color  Color_Encoded
0    Red              2
1   Blue              0
2  Green              1
3   Blue              0
4    Red              2




### Q3: In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

A3. Nominal encoding is preferred over one-hot encoding when the categorical feature has a large number of categories. One-hot encoding would create too many columns, leading to a sparse matrix and increased computational complexity.

**Example:**

For a feature representing zip codes, using one-hot encoding would create thousands of columns, while nominal encoding would only create one column with unique integers for each zip code.

### Q4: Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.

A4. **One-Hot Encoding** would be the preferred choice because it avoids the potential ordinal relationship problem introduced by nominal encoding. It ensures that the model does not assume any inherent order in the categories.


In [3]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

data = pd.DataFrame({'Category': ['A', 'B', 'C', 'A', 'B']})

# Applying One-Hot Encoding
encoder = OneHotEncoder(sparse_output=False)
encoded_data = encoder.fit_transform(data[['Category']])

encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['Category']))
print(encoded_df)

   Category_A  Category_B  Category_C
0         1.0         0.0         0.0
1         0.0         1.0         0.0
2         0.0         0.0         1.0
3         1.0         0.0         0.0
4         0.0         1.0         0.0


### Q5: In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.

A5. If using nominal encoding for 2 categorical columns, each column would be transformed into a single new column with integer values.

- Original columns: 3 numerical + 2 categorical = 5 columns
- After nominal encoding: 3 numerical + 2 nominally encoded = 5 columns

No new columns are created; the 2 categorical columns are transformed into 2 numerical columns.

### Q6: You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.

A6. **One-Hot Encoding** would be suitable for encoding species, habitat, and diet because these categories do not have any ordinal relationship. One-hot encoding ensures that each category is treated independently by the machine learning algorithm.


In [5]:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

data = pd.DataFrame({
    'Species': ['Dog', 'Cat', 'Bird'],
    'Habitat': ['Land', 'Land', 'Air'],
    'Diet': ['Carnivore', 'Carnivore', 'Herbivore']
})

# Applying One-Hot Encoding
encoder = OneHotEncoder(sparse_output=False)
encoded_data = encoder.fit_transform(data[['Species', 'Habitat', 'Diet']])

encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['Species', 'Habitat', 'Diet']))
print(encoded_df)

   Species_Bird  Species_Cat  Species_Dog  Habitat_Air  Habitat_Land  \
0           0.0          0.0          1.0          0.0           1.0   
1           0.0          1.0          0.0          0.0           1.0   
2           1.0          0.0          0.0          1.0           0.0   

   Diet_Carnivore  Diet_Herbivore  
0             1.0             0.0  
1             1.0             0.0  
2             0.0             1.0  


### Q7: You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

A7.
1. **Identify categorical features**: Gender, Contract Type
2. **Apply Label Encoding** for gender since it has only two categories.
3. **Apply One-Hot Encoding** for contract type since it may have more than two categories.

In [6]:

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Sample data
data = pd.DataFrame({
    'Gender': ['Male', 'Female', 'Female', 'Male'],
    'Age': [34, 22, 45, 37],
    'ContractType': ['Month-to-Month', 'One year', 'Two year', 'Month-to-Month'],
    'MonthlyCharges': [56.95, 53.85, 42.30, 70.50],
    'Tenure': [1, 34, 56, 12]
})

# Apply Label Encoding to Gender
label_encoder = LabelEncoder()
data['Gender_Encoded'] = label_encoder.fit_transform(data['Gender'])

# Apply One-Hot Encoding to ContractType
one_hot_encoder = OneHotEncoder(sparse_output=False)
contract_type_encoded = one_hot_encoder.fit_transform(data[['ContractType']])
contract_type_df = pd.DataFrame(contract_type_encoded, columns=one_hot_encoder.get_feature_names_out(['ContractType']))

# Combine all the encoded features
data = pd.concat([data, contract_type_df], axis=1)

# Drop the original categorical columns
data.drop(['Gender', 'ContractType'], axis=1, inplace=True)

print(data)

   Age  MonthlyCharges  Tenure  Gender_Encoded  ContractType_Month-to-Month  \
0   34           56.95       1               1                          1.0   
1   22           53.85      34               0                          0.0   
2   45           42.30      56               0                          0.0   
3   37           70.50      12               1                          1.0   

   ContractType_One year  ContractType_Two year  
0                    0.0                    0.0  
1                    1.0                    0.0  
2                    0.0                    1.0  
3                    0.0                    0.0  
