In [None]:
Data encoding is the process of converting data from one form to another, often to facilitate storage,
processing, or transmission. In data science, encoding is useful for several reasons:

Categorical Data: Encoding categorical variables into numerical values allows machine learning models to
interpret them correctly. Common techniques include one-hot encoding and label encoding.

Text Data: Encoding text into numerical representations (e.g., word embeddings) enables the use of text
data in machine learning models.

Compression: Encoding can reduce the size of data, making it more efficient to store and transmit.

In [None]:
Nominal encoding is a type of encoding used for categorical variables where the 
categories have no intrinsic order or ranking. It also known as one-hot encoding.

In one-hot encoding, each category is represented by a binary vector, where each element in the vector 
corresponds to a category, and only one element is 1 (indicating the presence of that category) while the
rest are 0.


Let say you have a dataset of cars, and one of the categorical variables is "color" with categories "red,
" "blue," and "green." After nominal encoding, the "color" variable would be represented as three binary 
variables: "color_red," "color_blue," and "color_green." For a red car, the "color_red" variable would be 
1 and the others 0. For a blue car, "color_blue" would be 1 and the others 0, and so on

In [None]:
Nominal encoding and one-hot encoding are actually the same thing. One-hot encoding is a type of nominal 
encoding used for categorical variables where the categories have no intrinsic order or ranking.

In situations where you have categorical variables without any ordinal relationship
(like colors, types of cars, etc.), you would typically use one-hot encoding, which creates binary columns
for each category.

For example, in a dataset of cars where one of the features is "color" with categories "red," "blue," and
"green," you would use one-hot encoding to represent each color as a binary variable 
(e.g., "color_red," "color_blue," "color_green") with values of 0 
or 1

In [None]:
If the dataset contains categorical data with 5 unique values and these values do 
not have an inherent ordinal relationship (i.e., they are nominal), I would use one-hot encoding to
transform this data into a format suitable for machine learning algorithms.

One-hot encoding would create 5 binary columns, each representing one of the unique values. Each row
in the dataset would have a 1 in the column corresponding to its value and 0s in the other columns. 
This encoding ensures that the machine learning algorithm does not interpret the categorical values as
having any ordinal relationship, which is important for maintaining the integrity of the data.

In [None]:
If you were to use nominal encoding (specifically one-hot encoding) to transform the two categorical 
columns in the dataset, you would create new binary columns for each unique value in each categorical column

Let say the first categorical column has 4 unique values and the second categorical column has 3 
unique values.

For the first categorical column, you would create 4 new binary columns (one for each unique value), 
and for the second categorical column, you would create 3 new binary columns.

Therefore, the total number of new columns created would be 4 (from the first categorical column) + 3 
(from the second categorical column) = 7 new columns.

In [None]:
To transform the categorical data about different types of animals into a format 
suitable for machine learning algorithms, I would use a combination of label encoding and one-hot encoding,
depending on the nature of the categorical variables.

Label Encoding: If the categorical variables have an ordinal relationship (e.g., small, medium, large),
I would use label encoding to convert them into numerical values. This allows the algorithm to understand
the order or hierarchy within the categories.

One-Hot Encoding: For categorical variables without an inherent order or hierarchy (e.g., species, habitat),
I would use one-hot encoding. One-hot encoding creates binary columns for each unique category, which helps
prevent the algorithm from interpreting these categories as having any ordinal relationship.

For example, if there is a categorical variable "species" with categories like "lion," "tiger," and 
"elephant," one-hot encoding would create three binary columns (one for each category) where each row 
corresponds to a different animal, and the value in each column indicates whether that animal belongs 
to that species.

In [None]:
Label Encoding: Use label encoding to convert the 'gender' feature into numerical
values. For example, 'male' could be encoded as 0 and 'female' as 1.

One-Hot Encoding: Use one-hot encoding to convert the 'contract type' feature into binary columns. 
For example, if the 'contract type' feature has categories 'month-to-month', 'one year', and 'two year', 
create three new binary columns: 'contract_month-to-month', 'contract_one year', 'contract_two year'.
Set the value to 1 in the corresponding column for each customer contract type, and 0 in the other 
columns.

After encoding, the dataset would have numerical values for all features, including the transformed 
'gender' and 'contract type' features, which can be used for predicting customer churn using machine 
learning algorithms.