# Q1. What is data encoding? How is it useful in data science?

Data encoding is converting data into a format that computers can understand, which is crucial in data science for handling different types of data, like turning categories into numbers for machine learning and improving memory usage.

# Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

Nominal encoding is a type of data encoding used to convert categorical data into numerical values without assigning any specific order or ranking to the categories.

In a real-world scenario, let's say you're working on a marketing analysis for an e-commerce company. One of the features in your dataset is "Product Category," which includes categories like "Electronics," "Clothing," and "Books." Since these categories don't have a specific order, you would use nominal encoding to convert them into numerical values for machine learning algorithms. This allows the algorithms to process and analyze the data effectively while preserving the categorical nature of the feature.

# Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

Nominal encoding is preferred over one-hot encoding in situations where the categorical feature has a high cardinality (many unique categories) or when there is no meaningful order or hierarchy among the categories. One-hot encoding can lead to a significant increase in the dimensionality of the dataset, which might not be practical, especially when dealing with a large number of categories.

Practical Example:
Let's consider a scenario where you're analyzing customer reviews for a product. The dataset includes a "Keywords" column indicating the keywords mentioned in the reviews. These keywords can be quite diverse and numerous, including terms like "Quality," "Price," "Shipping," "Customer Service," and more.

Using one-hot encoding for this "Keywords" column would create a binary column for each keyword, leading to a large number of new features. This could result in a sparse matrix, increased computation time, and potentially overfitting due to the high dimensionality.

Instead, using nominal encoding for the "Keywords" column assigns a unique numerical value to each keyword while avoiding the exponential increase in dimensions. This approach helps maintain a manageable dataset size and allows the machine learning algorithm to learn patterns from the keywords without creating an excessively complex model.

# Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.

Given a dataset with 5 unique categorical values, one suitable encoding technique is one-hot encoding as, there are not many features.

# Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.

For nominal encoding, each unique category within a categorical column is assigned a unique numerical value. Given that two columns are categorical in the dataset, let's calculate the number of new columns that would be created:

Assuming the first categorical column has n unique categories and the second categorical column has m unique categories:

Number of new columns = Number of unique categories in the first column + Number of unique categories in the second column
= n + m

# Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.

I would use nominal encoding, as the number of unique values would be high. One Hot Encoding is not suitable for high features as it would cerate sparse matrix.

# Q7.You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

I would use One Hot Encoding as there aren't many unique values in features.

In [39]:
# Import the Libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [40]:
# Create a sample dataset
data = {
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'contract_type': ['Monthly', 'Yearly', 'Monthly', 'Monthly', 'Yearly'],
    'age': [25, 30, 22, 35, 28],
    'monthly_charges': [50, 80, 60, 75, 65],
    'tenure': [12, 24, 6, 18, 10]
}

df = pd.DataFrame(data)
df

Unnamed: 0,gender,contract_type,age,monthly_charges,tenure
0,Male,Monthly,25,50,12
1,Female,Yearly,30,80,24
2,Male,Monthly,22,60,6
3,Female,Monthly,35,75,18
4,Male,Yearly,28,65,10


In [41]:
# Indentify the categorical features
categorical_features = ['gender', 'contract_type']
categorical_features

['gender', 'contract_type']

In [51]:
# Creating an instance of LabelEncoder
encoder = LabelEncoder()

In [52]:
# Transforming df
for feature in categorical_features:
    df[feature] = encoder.fit_transform(df[feature])
df

Unnamed: 0,gender,contract_type,age,monthly_charges,tenure
0,1,0,25,50,12
1,0,1,30,80,24
2,1,0,22,60,6
3,0,0,35,75,18
4,1,1,28,65,10
