### Features Engineering 4

### Question 1

Q1. What is data encoding? How is it useful in data science?

__Answer__

Data encoding is the process of converting data from one format to another. In the context of data science, it often refers to the process of converting categorical data into numerical data that can be used as input for machine learning algorithms.

For more context, categorical data is data that can be divided into categories or groups, such as gender (male or female), color (red, green, blue), or size (small, medium, large). Many machine learning algorithms require numerical input data, so it is necessary to encode categorical data into a numerical format before using it as input for these algorithms.

There are several common techniques for encoding categorical data, including:
1. one-hot encoding
2. label encoding
3. binary encoding. 

However, each technique has its own advantages and disadvantages and the choice of encoding method will depend on the specific use case.

__Usefulness__

Data encoding is useful in data science because it allows us to convert categorical data into a format that can be used as input for machine learning algorithms. This enables us to include categorical features in our models and can improve the performance of our algorithms by providing additional information about the data.

### Question 2

Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario

__Answer__

Nominal encoding, also known as one-hot encoding or dummy encoding, is a technique used in machine learning to represent categorical variables with binary values. In nominal encoding, each unique category of a categorical variable is assigned a binary value (0 or 1), where 0 indicates the absence of the category and 1 indicates the presence of the category.


__A real-world scenario:__

Suppose you are building a recommendation system for a movie streaming service. One of the features you want to include in your model is the genre of a movie, which can be a categorical variable with values such as "Action," "Comedy," "Drama," etc. Since machine learning models typically work with numerical data, you need to encode these categorical values into binary values using nominal encoding.

You can create binary columns for each genre category, and for each movie in your dataset, set the corresponding binary value to 1 if the movie belongs to that genre, and 0 otherwise. For example, if a movie is classified as an "Action" movie, the "Action" column will have a value of 1, and the columns for other genres will have values of 0. This way, the categorical variable "genre" is converted into a binary feature representation that can be used as input to your machine learning model.

Here's an example of how the nominal encoding might look like for a movie dataset:

Movie	Genre	Action	Comedy	Drama
Movie1	Action	  1	       0	0
Movie2	Comedy	  0	       1	0
Movie3	Drama	  0	       0	1
Movie4	Action	  1	       0	0

__Further Explanation__

In this example, the "Genre" column is nominal encoded into three binary columns "Action," "Comedy," and "Drama," where each binary column represents the presence or absence of a specific genre for each movie in the dataset. This nominal encoding allows the machine learning model to understand and utilize the categorical variable "genre" as input for making recommendations. 

So, for instance, if you are building a recommendation model and a user prefers action movies, the model can leverage the binary "Action" column to make personalized recommendations accordingly. 

Overall, nominal encoding is a common technique used to represent categorical variables in machine learning models, allowing them to work effectively with categorical data.

Overall, nominal encoding is a common technique used to represent categorical variables in machine learning models, allowing them to work effectively with categorical data.






### Question 3

Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

__Answer__

__Nominal encoding__ is preferred over one-hot encoding in situations where the categorical variables have a large number of unique values or levels. One practical example is when dealing with text data, such as words in a document or product names in an e-commerce dataset.

__Example__

For instance, consider a dataset of customer reviews for a product, where the review text contains multiple words. One-hot encoding would create a binary feature for each unique word in the text, resulting in a very high-dimensional and sparse dataset. This can lead to increased computational complexity and storage requirements, and may also cause issues with overfitting in machine learning models.

So, nominal encoding can be used to map each unique word to a numeric code or index, without creating separate binary features for each word. This reduces the dimensionality of the dataset and makes it more manageable for analysis and modeling. Additionally, nominal encoding can help capture any potential relationships or patterns among the different words, as they are represented by numerical values.


### Question 4

Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.

__Answer__

One-hot encoding will be my go to choice for this scenario.

__Reason__

Because the dimensionality is will not be too complex and each category can contribute meaningfully to the machine learning algorithms

straightforward and interpretable representation of the categorical data, where each category is treated as a separate binary feature.  This allows machine learning algorithms to easily capture any potential relationships or patterns among the different categories.

it is commonly supported by most machine learning libraries and algorithms, making it a convenient choice for data preprocessing.

preserves the ordinality of the categories, as each category is represented by a separate binary feature, allowing the algorithms to capture any potential ordinal relationships among the categories if they exist.



### Question 5

Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.


__Answer__

Assuming that the first categorical column has 10 unique categories, and the second categorical column has 5 unique categories.

For the first categorical column with 10 unique categories, nominal encoding would create 10 new columns (one for each category) with binary values (0/1) representing the presence or absence of each category.

For the second categorical column with 5 unique categories, nominal encoding would create 5 new columns (one for each category) with binary values (0/1) representing the presence or absence of each category.

Therefore, in total, nominal encoding would create 10 + 5 = 15 new columns in this scenario.


Check below to see example solved with code

In [1]:
import pandas as pd

# Generate sample data
data = {
    'categorical_col1': ['A', 'B', 'C', 'D', 'E'] * 200,  # 5 unique categories
    'categorical_col2': ['X', 'Y', 'Z', 'W', 'V'] * 200,  # 5 unique categories
    'numerical_col1': [1, 2, 3, 4, 5] * 200,
    'numerical_col2': [10, 20, 30, 40, 50] * 200,
    'numerical_col3': [100, 200, 300, 400, 500] * 200
}

df = pd.DataFrame(data)

# Perform nominal encoding
df_nominal = pd.get_dummies(df, columns=['categorical_col1', 'categorical_col2'])

# Calculate number of new columns created
num_new_columns = df_nominal.shape[1] - df.shape[1]

print("Number of new columns created by nominal encoding: ", num_new_columns)


Number of new columns created by nominal encoding:  8


### Question 6

Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.

__Answer__

I will use __One-Hot Encoding Technique__ because:

1. the features are categorical variables that are not ranked or have a natural order of existence. 

2. each animal can belong to only one category within each feature making one hot coding best technique

Moreso, level of categories within these feature is not will be greater than or equal to 5 each


### Question 7

Q7.You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

__Answer__

In this dataset, the categorical variables are gender and contract type:

I will use two encoding techniques which are:

1. Label encoding
2. One hot encoding

__Step_by_step explanation__

1. Contact type: identify the level of this categories, hypothetically: we will have (a) Month (b) year (c) two year, so these can be label coded respective with 0, 1 and 2.


2. Gender: One-Hot encoding will be used for this feaute, which has 2 categories (Male and Female) In this case, a new feature would be created for Male and another for Female. If a customer is male, the Male feature would be 1 and the Female feature would be 0. If a customer is female, the Female feature would be 1 and the Male feature would be 0.



In [None]:
### label encoding

from sklearn.preprocessing import LabelEncoder

# create LabelEncoder object
le = LabelEncoder()

# fit and transform the 'Contract type' feature
data['Contract type'] = le.fit_transform(data['Contract type'])

In [None]:
### for one_hot encoding

import pandas as pd

# apply One-Hot Encoding to the 'Gender' feature
gender_onehot = pd.get_dummies(data['Gender'], prefix='Gender')

# concatenate the new binary features with the original dataset
data = pd.concat([data, gender_onehot], axis=1)

# drop the original 'Gender' feature from the dataset
data = data.drop(['Gender'], axis=1)

### The End