## Q1. What is data encoding? How is it useful in data science?

Data encoding is the process of converting data into a specific format for efficient storage, transmission, and processing. In data science, it is useful for transforming categorical data into numerical values, facilitating the use of machine learning algorithms that require numerical input, and improving the performance of data processing and analysis tasks.

## Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

Nominal encoding, also known as one-hot encoding, is a technique used to convert categorical data into a binary matrix representation. Each category is represented by a unique binary vector, where only one element is "1" and the rest are "0".

###### Real-World Application:
In a customer survey dataset, you have a "Preferred Contact Method" column with categories like "Email," "Phone," and "SMS." To use this data in a machine learning model, you would apply nominal encoding to transform these categories into a binary matrix, allowing the model to process the data effectively

## Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

Nominal encoding is actually another name for one-hot encoding, so they refer to the same technique. However, if you are asking about situations where one-hot encoding (nominal encoding) is preferred over other encoding methods like label encoding, here’s an explanation:

##### Situations where One-Hot Encoding is Preferred:
Non-ordinal categorical data: When the categorical data does not have a natural order or ranking.
Avoiding misleading relationships: Preventing the model from assuming any ordinal relationship between categories.
Machine Learning Compatibility: Many machine learning algorithms perform better or require one-hot encoded data for categorical features.

### Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms?Explain why you made this choice.

For a dataset containing categorical data with 5 unique values, I would use one-hot encoding. Here’s why:

Reason for Choosing One-Hot Encoding:

Avoids Ordinal Assumptions: One-hot encoding ensures that the machine learning model does not assume any ordinal relationship between the categories, which is crucial if the categories are nominal (without a natural order).

Algorithm Compatibility: Many machine learning algorithms, including linear models, neural networks, and tree-based models, perform better with one-hot encoded data since it allows them to treat each category independently.

Interpretability: The encoded features are easy to interpret, as each new binary feature represents the presence or absence of a specific category.

Example:
Assume the dataset has a column "Fruit Type" with 5 unique values: "Apple," "Banana," "Cherry," "Date," and "Elderberry."

Original Data:
 Fruit Type

Apple

Banana

Cherry

Date

Elderberry

After One-Hot Encoding:
Apple	Banana	Cherry	Date	Elderberry
1         	0  	     0  	0	         0
0	        1	     0	    0	         0
0	        0	     1	    0	         0
0	        0	     0	    1	         0
0	        0	     0	    0	         1

### Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.

To determine the number of new columns created by using nominal (one-hot) encoding for the categorical data in your dataset, follow these steps:

1. Identify the number of unique values in each categorical column.
2. Apply one-hot encoding to each categorical column.

### Example:
Let's assume the two categorical columns have the following unique values:
- Categorical Column 1: 4 unique values (e.g., A, B, C, D)
- Categorical Column 2: 3 unique values (e.g., X, Y, Z)

### Calculations:

- Categorical Column 1: One-hot encoding will create 4 new columns (one for each unique value).
- Categorical Column 2: One-hot encoding will create 3 new columns (one for each unique value).

#### Total New Columns:
 4 (from Categorical Column 1)} + 3(from Categorical Column 2)} = 7{ new columns} 

Therefore, after applying nominal (one-hot) encoding to the two categorical columns, the dataset will have:

- Original numerical columns: 3
- New one-hot encoded columns: 7

### Total Columns After Encoding:
 3 (numerical columns) + 7 (one-hot encoded columns) = 10 ( columns)

In summary, using nominal encoding for the two categorical columns will result in 7 new columns being created.

## Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.

For a dataset containing categorical information about different types of animals, including their species, habitat, and diet, I would use one-hot encoding to transform the categorical data into a format suitable for machine learning algorithms. Here's why:

Justification:
Nominal Nature of Data: The categories (species, habitat, diet) are nominal, meaning they do not have a natural order. One-hot encoding effectively handles such data by creating binary columns for each unique category, ensuring no ordinal assumptions are made by the model.

Algorithm Compatibility: One-hot encoding is widely supported and works well with most machine learning algorithms, including linear models, neural networks, and tree-based methods. It allows these algorithms to treat each category independently, which can lead to better performance.

Prevents Misleading Relationships: By using one-hot encoding, we avoid introducing misleading relationships between categories. For example, if we used label encoding, the model might incorrectly infer a ranking or distance between categories that do not have any inherent order.

Example:
Original Data:
Species	Habitat	Diet
Lion	Savanna	Carnivore
Elephant	Forest	Herbivore
Penguin	Antarctic	Carnivore
Kangaroo	Desert	Herbivore
Shark	Ocean	Carnivore
One-Hot Encoded Data:

Conclusion:
Using one-hot encoding for the species, habitat, and diet columns ensures that the machine learning model can process the data without making any incorrect assumptions about the relationships between categories. This approach leads to a more accurate and reliable model.

In [None]:
Lion Elephant	Penguin	Kangaroo	Shark	Savanna	Forest	Antarctic	Desert	Ocean	Carnivore	Herbivore
1	      0	         0	       0	    0	      1	     0	        0	     0	    0	        1	        0
0	      1	         0	       0	    0	      0	     1	        0	     0	    0	        0           1
0	      0        	1	       0     	0       0	     0	        1	     0	    0	        1	        0
0	      0	        0	       1	    0	      0	     0       	0      	1	    0	        0	        1
0	      0	        0	       0	    1	      0	     0	        0     	0	    1	        1        	0

## Q7.You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

For predicting customer churn with a dataset that includes categorical features like gender and contract type, you would need to convert these categorical features into numerical format for use in machine learning algorithms. The appropriate encoding techniques are one-hot encoding and label encoding. Here’s a step-by-step explanation of how you would implement these encodings:

Features:
Gender (categorical)
Age (numerical)
Contract Type (categorical)
Monthly Charges (numerical)
Tenure (numerical)

Encoding Techniques:
## 1. One-Hot Encoding
Use Case: For categorical variables where there is no ordinal relationship, such as gender and contract type.
Purpose: To create binary columns for each category, allowing the model to treat each category independently.
Step-by-Step Implementation:

a. Gender Encoding:

Original Data: "Male" and "Female"
One-Hot Encoding Result:
Male	Female
1	0
0	1
1	0
0	1


b. Contract Type Encoding:

Original Data: "Month-to-Month", "One Year", "Two Year"
One-Hot Encoding Result:
Month-to-Month	One Year	Two Year
1	                 0	          0
0	                 1	          0
0	                 0          	1
1	                 0	          0


## 2. Label Encoding
Use Case: Label encoding can be used for ordinal categorical variables, but it is less common for categorical features in churn prediction tasks unless they have a natural order. In this case, label encoding is not used since "gender" and "contract type" do not have a natural order.

In [3]:
# Implementing a one hot Encoding
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Sample DataFrame
data = {
    'Gender': ['Male', 'Female', 'Male', 'Female'],
    'Contract Type': ['Month-to-Month', 'One Year', 'Two Year', 'Month-to-Month'],
    'Age': [30, 25, 40, 35],
    'Monthly Charges': [70, 80, 90, 85],
    'Tenure': [12, 24, 18, 15]
}

df = pd.DataFrame(data)

# One-Hot Encoding for categorical features
encoder = OneHotEncoder(drop='first', sparse=False)
encoded_features = encoder.fit_transform(df[['Gender', 'Contract Type']])
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(['Gender', 'Contract Type']))

# Concatenate encoded features with numerical features
final_df = pd.concat([df[['Age', 'Monthly Charges', 'Tenure']], encoded_df], axis=1)

print(final_df)


   Age  Monthly Charges  Tenure  Gender_Male  Contract Type_One Year  \
0   30               70      12          1.0                     0.0   
1   25               80      24          0.0                     1.0   
2   40               90      18          1.0                     0.0   
3   35               85      15          0.0                     0.0   

   Contract Type_Two Year  
0                     0.0  
1                     0.0  
2                     1.0  
3                     0.0  




Final DataFrame:
The final DataFrame will have the following columns: Age, Monthly Charges, Tenure, Male, Female, Month-to-Month, One Year, Two Year.

Summary:

One-Hot Encoding is used for the "Gender" and "Contract Type" features to convert them into a format suitable for machine learning models.
Numerical features (Age, Monthly Charges, Tenure) are used directly without additional encoding