In [None]:
# Q1. What is data encoding? How is it useful in data science?

Data encoding is the process of converting categorical data into 
numerical data that can be easily processed by machine learning 
models. In data science, encoding is used to transform categorical 
data into numerical values, which can be used as input to machine 
learning models. Categorical data includes variables such as gender,
color, and occupation, which are not numerical in nature but can be 
represented using numerical values.

Encoding is useful in data science because machine learning models
cannot directly process categorical data. By encoding the categorical 
data into numerical values, it becomes possible to use these variables
as input features for the models. There are several different
encoding techniques available, including one-hot encoding, label
encoding, binary encoding, and ordinal encoding.

Overall, data encoding is an important preprocessing step in data
science, as it enables machine learning models to use categorical
data in the prediction process.



In [None]:
# Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

Nominal encoding is a type of data encoding used in machine learning
to represent categorical variables as numerical values.
In nominal encoding, each category is assigned a unique numerical 
value. This type of encoding is used when there is no inherent
ordering or ranking between the categories.

For example, suppose we have a dataset of customer reviews 
for a restaurant, and one of the features is the type of cuisine.
The cuisine types may include Italian, Mexican, Japanese, and Indian.
We can use nominal encoding to represent these categories as 
numerical values, such as:

Italian: 1
Mexican: 2
Japanese: 3
Indian: 4
This allows us to use these categorical variables in machine 
learning models, as most machine learning algorithms work with
numerical values as inputs.



In [None]:
# Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

Nominal encoding is preferred over one-hot encoding when the 
number of categories in a categorical variable is too high. 
One-hot encoding can create too many columns and make the dataset 
very sparse, leading to the curse of dimensionality. In such cases, 
nominal encoding can be a better choice since it reduces the number 
of columns while still capturing the essence of the categorical 
variable.

An example of when nominal encoding might be preferred over 
one-hot encoding is in the case of a dataset that contains a 
categorical variable representing the country of origin of a product. 
If the dataset contains hundreds or thousands of unique countries, 
one-hot encoding would lead to an excessive number of columns. 
In this scenario, nominal encoding can be used to encode 
the country variable into a few columns representing broad 
regions, such as Europe, Asia, North America, and so on. 
This way, we can capture the information about the country
of origin without creating an excessive number of columns.



In [None]:
# Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding
# technique would you use to transform this data into a format suitable for machine learning algorithms?
# Explain why you made this choice.

The choice of encoding technique depends on the nature of the 
data and the machine learning algorithm used. 
If the categorical data has a natural ordering, 
such as low, medium, and high, ordinal encoding can be used. 
If the categorical data has no natural ordering, 
one-hot encoding is typically used. 
In the case of 5 unique values, one-hot encoding would be a suitable
choice since there are not too many categories to generate
a large number of new features. One-hot encoding creates a binary 
feature for each category, making it easy for machine learning 
algorithms to handle the categorical data.

In [None]:
# Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns
# are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to
# transform the categorical data, how many new columns would be created? Show your calculations.

If we use nominal encoding to transform the categorical data, 
we would create a new column for each unique value in each 
categorical column.

Let's assume that the first categorical column has 4 
unique values, and the second categorical column has 5 
unique values. Then, the total number of new columns created would be:

4 (for the first categorical column) + 5 (for the second categorical column) + 3 (for the three numerical columns) = 12

Therefore, we would create 12 new columns using nominal encoding.



In [None]:
# Q6. You are working with a dataset containing information about different types of animals, including their
# species, habitat, and diet. Which encoding technique would you use to transform the categorical data into
# a format suitable for machine learning algorithms? Justify your answer.

The choice of encoding technique to transform categorical data 
into a format suitable for machine learning algorithms depends
on the nature of the data and the problem at hand. 
In the case of the animal dataset, nominal encoding might be a 
suitable choice if the categorical features are not ordinal
and do not have any inherent order or hierarchy. 
For example, the species of an animal is not inherently
hierarchical, and nominal encoding could be used to transform
this categorical feature into a format that can be used in
machine learning algorithms.

On the other hand, if the categorical data has an inherent order 
or hierarchy, such as low, medium, and high, then 
ordinal encoding might be a more appropriate choice. 
Similarly, if the categorical data has many unique values, 
one-hot encoding might be more appropriate to avoid the issue
of high dimensionality. Therefore, the choice of encoding technique 
depends on the specific characteristics of the data and 
the problem at hand.



In [11]:
# Q7.You are working on a project that involves predicting customer churn for a telecommunications
# company. You have a dataset with 5 features, including the customer's gender, age, contract type,
# monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical
# data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd

# Generating sample data
gender = np.random.choice(['Male', 'Female'], size=1000)
contract_type = np.random.choice(['Month-to-month', 'One year', 'Two year'], size=1000)
age = np.random.normal(loc=40, scale=10, size=1000).astype(int)
monthly_charges = np.random.normal(loc=70, scale=20, size=1000)
tenure = np.random.randint(low=0, high=72, size=1000)

# Creating dataframe
data = pd.DataFrame({
    'Gender': gender,
    'ContractType': contract_type,
    'Age': age,
    'MonthlyCharges': monthly_charges,
    'Tenure': tenure
})

# Binary encoding for gender feature
data['Gender'] = data['Gender'].replace({'Male': 0, 'Female': 1})
# data
# # One-hot encoding for contract type feature
encoder = OneHotEncoder()
con_type=encoder.fit_transform(data[['ContractType']]).toarray()

encoded_features = pd.DataFrame(con_type,columns=['Month_to_Month', 'One year', 'Two_year'])
data = pd.concat([data, encoded_features], axis=1)
data

Unnamed: 0,Gender,ContractType,Age,MonthlyCharges,Tenure,Month_to_Month,One year,Two_year
0,0,Month-to-month,43,50.323149,47,1.0,0.0,0.0
1,1,Two year,41,22.954624,0,0.0,0.0,1.0
2,1,Two year,39,69.116844,43,0.0,0.0,1.0
3,0,Month-to-month,26,61.906450,71,1.0,0.0,0.0
4,1,Month-to-month,39,95.959852,40,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...
995,1,One year,29,81.387916,50,0.0,1.0,0.0
996,0,Two year,35,87.078197,0,0.0,0.0,1.0
997,0,Two year,69,92.387925,38,0.0,0.0,1.0
998,1,Two year,43,83.765868,18,0.0,0.0,1.0
