##### Q1. What is data encoding? How is it useful in data science?

Data encoding is the process of converting data from one form to another, usually for the purpose of transmission, storage, or analysis. Data decoding is the reverse process of converting data back to its original form, usually for the purpose of interpretation or use.

In data science, data encoding and decoding play a crucial role as they act as a bridge between raw data and actionable insights. They enable us to:

- Prepare data for analysis by transforming it into a suitable format that can be processed by algorithms or models.
- Engineer features by extracting relevant information from data and creating new variables that can improve the performance or accuracy of analysis.
- Compress data by reducing its size or complexity without losing its essential information or quality.
- Protect data by encrypting it or masking it to prevent unauthorized access or disclosure.

##### Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

Nominal encoding is a technique used in data science to convert categorical data into numerical data. It is used when the categorical data has no order or rank to it.
For example, if we have a feature where variables are just names and there is no order or rank to this variable’s feature, such as city of person lives in, gender of person, marital status, etc., we use nominal encoding.
real-world scenario: Suppose we have a dataset containing customer feedback on a product. One of the features in the dataset is “Sentiment”, which contains three categories: Positive, Negative, and Neutral. We can use LabelEncoder to convert these categories into numerical values so that the machine can learn from them and predict the sentiment of future customer feedback.

In [1]:
from sklearn.preprocessing import LabelEncoder

# Create a sample dataset
sentiments = ['Positive', 'Negative', 'Neutral', 'Positive', 'Neutral']

# Initialize LabelEncoder
le = LabelEncoder()

# Fit and transform the data
encoded_sentiments = le.fit_transform(sentiments)

# Print the encoded data
print(encoded_sentiments)


[2 0 1 2 1]


##### Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

Nominal encoding may be preferred over one-hot encoding in situations where the categorical variable has a large number of unique categories, and we want to reduce dimensionality. 
Nominal encoding results in a single column of integers, making it more compact than one-hot encoding. 

In [2]:
#Here's a practical example
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data = {'Color': ['Red', 'Blue', 'Green', 'Red', 'Yellow', 'Green']}
df = pd.DataFrame(data)

label_encoder = LabelEncoder()
df['Color_ncoded'] = label_encoder.fit_transform(df['Color'])

print(df)

    Color  Color_ncoded
0     Red             2
1    Blue             0
2   Green             1
3     Red             2
4  Yellow             3
5   Green             1


##### Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.

**Answer:**
The choice of encoding technique for transforming categorical data with 5 unique values depends on the nature of the data and the requirements of the machine learning algorithm. Here are two common encoding techniques and considerations for each:

**Nominal Encoding (Label Encoding):**  
_Description_: Nominal encoding assigns a unique integer to each category. This is suitable when there is no inherent order or ranking among the categories.  
_Example_: Using Label Encoding to map the 5 unique values to integers 0 through 4.

**One-Hot Encoding:**  
_Description_: One-hot encoding creates binary columns for each category, where each column indicates the presence or absence of a particular category. This is suitable when there is no ordinal relationship among the categories, and we want to prevent the model from assuming any ordinality.  
_Example_: Using One-Hot Encoding to represent the 5 unique values with binary columns.

**Choice and Considerations:**  
If the categorical values represent distinct and unordered categories (e.g., colors, types, labels), and there is no inherent order or ranking, we might choose **Nominal Encoding (Label Encoding)**. This approach is simpler and results in a single column of integers.

If there is no ordinal relationship among the 5 unique values, and we want to maintain independence among the categories, we might choose **One-Hot Encoding**. This approach ensures that the model doesn't interpret any ordinal relationship between the categories

##### Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.

**Answer**  
If we have to use nominal encoding (label encoding) to transform the categorical data in two columns, we would create a single new column for each categorical column. Each new column would contain integer-encoded values corresponding to the unique categories in the original columns.

Therefore, the number of new columns created would be equal to the number of categorical columns we are encoding.

In this case, we have 2 categorical columns, so we would create 2 new columns.

So, the answer is **2 new columns**.

##### Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.


**Answer**:
In the context of a dataset containing information about different types of animals with categorical attributes such as species, habitat, and diet, the choice of encoding technique depends on the nature of the categorical variables and the requirements of the machine learning algorithm. Here are two common encoding techniques and considerations for each:

- If the categorical variables represent distinct and unordered categories (e.g., species, habitat, and diet), and there is no inherent order or ranking, we might choose **Nominal Encoding (Label Encoding)** for simplicity and compact representation.
- If there is no ordinal relationship among the categories, and we want to maintain independence among the categories, we might choose One-Hot Encoding to provide an explicit representation.

In [3]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data = {'Species': ['Lion', 'Tiger', 'Giraffe', 'Elephant'],
        'Habitat': ['Jungle', 'Savannah', 'Zoo', 'Forest'],
        'Diet': ['Carnivore', 'Carnivore', 'Herbivore', 'Herbivore']}

df = pd.DataFrame(data)

# Nominal Encoding for each column
label_encoder = LabelEncoder()
df['Species_Encoded'] = label_encoder.fit_transform(df['Species'])
df['Habitat_Encoded'] = label_encoder.fit_transform(df['Habitat'])
df['Diet_Encoded'] = label_encoder.fit_transform(df['Diet'])

df

Unnamed: 0,Species,Habitat,Diet,Species_Encoded,Habitat_Encoded,Diet_Encoded
0,Lion,Jungle,Carnivore,2,1,0
1,Tiger,Savannah,Carnivore,3,2,0
2,Giraffe,Zoo,Herbivore,1,3,1
3,Elephant,Forest,Herbivore,0,0,1


##### Q7.You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

**Answer**  
In the context of predicting customer churn for a telecommunications company with a dataset containing categorical features such as gender and contract type, we would typically use encoding techniques to convert these categorical features into numerical format. Here's a step-by-step explanation of how we might implement the encoding:

**Features:**
1. Gender
2. Contract type

**Encoding Techniques:**
1. **Nominal Encoding (Label Encoding) for Binary Categorical Features:**
- For the "gender" feature (assuming it has two categories, e.g., 'Male' and 'Female'), we can use nominal encoding (label encoding) since it's a binary categorical feature.
- Step-by-Step Implementation:

In [4]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

df = {
    'gender' : ['Male','Female','Male'],
    'ContractType' : ['Month-to-Month', 'One_Year', 'Two_Year']
}
# Assuming 'gender' is a binary categorical column in our DataFrame
label_encoder = LabelEncoder()
df['Gender_Encoded'] = label_encoder.fit_transform(df['gender'])

df1 = pd.DataFrame(df)
df1


Unnamed: 0,gender,ContractType,Gender_Encoded
0,Male,Month-to-Month,1
1,Female,One_Year,0
2,Male,Two_Year,1


2.  **One-Hot Encoding for Non-Binary Categorical Features:**
- For the "contract type" feature (assuming it has more than two categories, e.g., 'Month-to-month', 'One year', 'Two years'), we should use one-hot encoding since it's a non-binary categorical feature.
- Step-by-Step Implementation:

In [5]:
from sklearn.preprocessing import OneHotEncoder
onehotencoder_encoder = OneHotEncoder()

#Performing OneHotEncoder on 'ContractType'
df2 = pd.DataFrame(onehotencoder_encoder.fit_transform(df1[['ContractType']]).toarray(), columns=onehotencoder_encoder.get_feature_names_out())

df = pd.concat([df1,df2],axis=1)

df

Unnamed: 0,gender,ContractType,Gender_Encoded,ContractType_Month-to-Month,ContractType_One_Year,ContractType_Two_Year
0,Male,Month-to-Month,1,1.0,0.0,0.0
1,Female,One_Year,0,0.0,1.0,0.0
2,Male,Two_Year,1,0.0,0.0,1.0


In this example, 'Gender_Encoded' represents the encoded values for the 'gender' feature, and one-hot encoding creates separate binary columns for each category in the 'contracType' feature. This numerical representation allows us to use these features in machine learning models for predicting customer churn.