### Q1. What is data encoding? How is it useful in data science?

Ans)**Data encoding**, also known as data transformation or data conversion, is the process of converting data from one representation or format to another. In the context of data science, data encoding is used to convert categorical or textual data into numerical form, which is more suitable for machine learning algorithms and data analysis.

Data encoding is useful for several reasons:

**Numerical Representation:** Many machine learning algorithms and statistical models require numerical inputs. Data encoding enables the conversion of categorical features, such as gender (e.g., "male" and "female") or city names, into numerical values (e.g., 0 and 1, or 1, 2, 3, etc.) that can be processed by these algorithms.

**Data Standardization:** Data encoding helps to standardize data across different sources or systems. By converting data to a common numerical format, data scientists can work with consistent representations of categorical variables.

**Efficient Computation:** Numerical data is easier and faster to compute compared to text or categorical data. Machine learning algorithms often involve extensive mathematical operations, and encoding data into numerical form can significantly speed up the computation process.

**Feature Engineering:** Data encoding is an essential part of feature engineering, where data scientists transform raw data into meaningful features that can enhance model performance and predictive accuracy.

**Handling Missing Values:** Data encoding can also help in handling missing values. For example, if a categorical feature contains missing values, data encoding can assign a specific value to represent those missing entries.

##### Common methods of data encoding in data science include:

1.Label Encoding: Assigning unique integers to each category in a categorical variable.   
2.One-Hot Encoding: Creating binary columns for each category, indicating the presence (1) or absence (0) of that category.   
3.Ordinal Encoding: Assigning integers to categories based on a predefined order or ranking.     
4.Binary Encoding: Converting categories to binary representation and then encoding them as integers.    
5.Hash Encoding: Using hash functions to convert categories into numerical representations.    

### Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

Ans)One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category.

In [10]:
#For example, if we have a dataset that contains Car Model and Fuel Type:
import pandas as pd

data = {"Car Model ": ['Volkswagon Vento','Mecedes GLA 200','Tata Nexon','BMWX5','Maruthi Grand Vitata','XUV500'],
        "Fuel Type" : ['Diesel','Ptrol','Electric','Petrol','Hybrid','Diesel']}

df = pd.DataFrame(data)
df

Unnamed: 0,Car Model,Fuel Type
0,Volkswagon Vento,Diesel
1,Mecedes GLA 200,Ptrol
2,Tata Nexon,Electric
3,BMWX5,Petrol
4,Maruthi Grand Vitata,Hybrid
5,XUV500,Diesel


In [11]:
# Perform nominal encoding (one-hot encoding) on the 'Fuel Type' column
encoded_df = pd.get_dummies(df, columns=['Fuel Type'])

# Display the encoded DataFrame
print(encoded_df)

             Car Model   Fuel Type_Diesel  Fuel Type_Electric  \
0      Volkswagon Vento                 1                   0   
1       Mecedes GLA 200                 0                   0   
2            Tata Nexon                 0                   1   
3                 BMWX5                 0                   0   
4  Maruthi Grand Vitata                 0                   0   
5                XUV500                 1                   0   

   Fuel Type_Hybrid  Fuel Type_Petrol  Fuel Type_Ptrol  
0                 0                 0                0  
1                 0                 0                1  
2                 0                 0                0  
3                 0                 1                0  
4                 1                 0                0  
5                 0                 0                0  


In [12]:
pd.concat([df,encoded_df],axis =1)

Unnamed: 0,Car Model,Fuel Type,Car Model.1,Fuel Type_Diesel,Fuel Type_Electric,Fuel Type_Hybrid,Fuel Type_Petrol,Fuel Type_Ptrol
0,Volkswagon Vento,Diesel,Volkswagon Vento,1,0,0,0,0
1,Mecedes GLA 200,Ptrol,Mecedes GLA 200,0,0,0,0,1
2,Tata Nexon,Electric,Tata Nexon,0,1,0,0,0
3,BMWX5,Petrol,BMWX5,0,0,0,1,0
4,Maruthi Grand Vitata,Hybrid,Maruthi Grand Vitata,0,0,1,0,0
5,XUV500,Diesel,XUV500,1,0,0,0,0


### Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

There is like a small difference in Label encoding we give unique numbers to categories like 1,2,3,4,5 etc.Here there is only 1 diension as there is a numerical value assigned to each group

But in One-hot encoding, on the other hand, creates a binary column for each category, indicating the presence (1) or absence (0) of that category.Here there is a significant increase in dimension because the more the categories there will same number of dimensions

In a model where increase in dimensions leads to complexity in such a case Nominal/Label Encoding is preffered,Whereas a model where Binary inputs are accepted in such a case One Hot Encoding is preffered.

In [19]:
"""Suppose you're building a decision tree model to classify customer satisfaction based on the type of payment method 
(a nominal feature with no order: 'Credit Card', 'PayPal', 'Bank Transfer').
"""

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample dataset
data = {'Customer': ['Aman', 'Robert', 'Harry', 'David'],
        'Payment_Method': ['Credit Card', 'PayPal', 'Bank Transfer', 'Credit Card']}

df = pd.DataFrame(data)

# Apply nominal encoding (label encoding)
le = LabelEncoder()
df['Payment_Encoded'] = le.fit_transform(df['Payment_Method'])

print(df)

  Customer Payment_Method  Payment_Encoded
0     Aman    Credit Card                1
1   Robert         PayPal                2
2    Harry  Bank Transfer                0
3    David    Credit Card                1


When not to use nominal encoding:
If you use a linear model (e.g., logistic regression), the algorithm may assume an ordinal relationship between labels (0 < 1 < 2), which is incorrect for nominal features — in such cases, one-hot encoding is preferred.

### Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.

If the dataset contains categorical data with 5 unique values, and there is no inherent order or ranking among the categories, I would prefer to use one-hot encoding to transform the data into a format suitable for machine learning algorithms.

Explanation for choosing one-hot encoding:

1.Nominal Data: Since the data has 5 unique values with no inherent order, it is nominal data. One-hot encoding is the most appropriate technique for nominal data because it converts each category into a binary vector representation, effectively removing any numerical relationship between the categories.

2.Avoiding Ordinality: One-hot encoding ensures that no ordinality is imposed among the categories. Each category is represented as a binary vector with a single '1' and the rest '0's, which prevents the model from interpreting any magnitude or rank relationship among the categories.

3.Model Interpretability: One-hot encoding provides better interpretability for the model's predictions. The encoded binary vectors directly represent the presence or absence of each category, making it easier to understand the impact of each category on the model's output.

4.Sparse Representation: One-hot encoding creates a sparse representation of the data, which is efficient in terms of memory usage and computation. Only one element in each binary vector is '1', reducing the amount of memory needed to store the encoded features.

5.Compatibility with Algorithms: Many machine learning algorithms, such as logistic regression, decision trees, and support vector machines, are designed to work with numerical inputs. One-hot encoding converts categorical data into numerical form, making it compatible with a wide range of machine learning algorithms.

### 5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.

When using nominal encoding to transform categorical data, the number of new columns created is equal to the number of unique categories in the original categorical columns. Each unique category is represented as a binary vector, where one column is created for each category.

Let's assume the two categorical columns have the following number of unique categories:


Categorical Column 1: 4 unique categories  
Categorical Column 2: 5 unique categories  

To calculate the total number of new columns created:


Total New Columns = Unique Categories in Categorical Column 1 + Unique Categories in Categorical Column 2  
= 4 + 5  
= 9

Therefore, when using nominal encoding to transform the two categorical columns, a total of 10 new columns will be created. Each row in the dataset will have 10 binary columns, representing the one-hot encoded values for the 4 unique categories in Categorical Column 1 and the 6 unique categories in Categorical Column 2. The rest of the three numerical columns will remain unchanged in the transformed dataset.


Total number of Columns in the dataset = 9 + 3 numerical coloumns
= 12

### Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.

#### One-Hot Encoding

Justification:

The categories like species, habitat, and diet are all nominal, meaning they have no natural order (e.g., “Carnivore” isn’t more or less than “Herbivore”).

One-hot encoding creates binary columns for each category, avoiding any false assumption of ordinal relationships.

This is especially helpful for linear models, logistic regression, k-nearest neighbors, and neural networks, which can be misled by numeric labels (like 0, 1, 2) from label encoding.

In [30]:
import pandas as pd

data = {
    'Species': ['Lion', 'Elephant', 'Shark'],
    'Habitat': ['Savannah', 'Forest', 'Ocean'],
    'Diet': ['Carnivore', 'Herbivore', 'Carnivore']
}

df = pd.DataFrame(data)

# One-hot encoding the nominal features
encoded_df = pd.get_dummies(df)

print(encoded_df)


   Species_Elephant  Species_Lion  Species_Shark  Habitat_Forest  \
0                 0             1              0               0   
1                 1             0              0               1   
2                 0             0              1               0   

   Habitat_Ocean  Habitat_Savannah  Diet_Carnivore  Diet_Herbivore  
0              0                 1               1               0  
1              0                 0               0               1  
2              1                 0               1               0  


### Q7.You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

In [31]:
import pandas as pd

dff= pd.DataFrame({
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'age': [25, 35, 28, 42, 30],
    'contract type': ['month-to-month', 'one year', 'two years', 'month-to-month', 'one year'],
    'monthly charges': [50.0, 65.0, 80.0, 55.0, 75.0],
    'tenure': [10, 20, 15, 5, 12]
})

dff

Unnamed: 0,gender,age,contract type,monthly charges,tenure
0,Male,25,month-to-month,50.0,10
1,Female,35,one year,65.0,20
2,Male,28,two years,80.0,15
3,Female,42,month-to-month,55.0,5
4,Male,30,one year,75.0,12


In [37]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

dff= pd.DataFrame({
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'age': [25, 35, 28, 42, 30],
    'contract type': ['month-to-month', 'one year', 'two years', 'month-to-month', 'one year'],
    'monthly charges': [50.0, 65.0, 80.0, 55.0, 75.0],
    'tenure': [10, 20, 15, 5, 12]
})

# One-Hot Encoding for gender
encodingg = OneHotEncoder()
val = encodingg.fit_transform(dff[['gender']]).toarray()

# Use get_feature_names() for older versions of scikit-learn
encode_df = pd.DataFrame(val, columns=encodingg.get_feature_names(['gender']))
dff = pd.concat([dff, encode_df], axis=1)

print(dff)

   gender  age   contract type  monthly charges  tenure  gender_Female  \
0    Male   25  month-to-month             50.0      10            0.0   
1  Female   35        one year             65.0      20            1.0   
2    Male   28       two years             80.0      15            0.0   
3  Female   42  month-to-month             55.0       5            1.0   
4    Male   30        one year             75.0      12            0.0   

   gender_Male  
0          1.0  
1          0.0  
2          1.0  
3          0.0  
4          1.0  


In [38]:
#Label Encoding for Contract type
from sklearn.preprocessing import OrdinalEncoder

ordinal = OrdinalEncoder(categories=[["month-to-month","one year","two years"]])
dff['contract_type_ranking'] = pd.DataFrame(ordinal.fit_transform(dff[['contract type']]))
dff

Unnamed: 0,gender,age,contract type,monthly charges,tenure,gender_Female,gender_Male,contract_type_ranking
0,Male,25,month-to-month,50.0,10,0.0,1.0,0.0
1,Female,35,one year,65.0,20,1.0,0.0,1.0
2,Male,28,two years,80.0,15,0.0,1.0,2.0
3,Female,42,month-to-month,55.0,5,1.0,0.0,0.0
4,Male,30,one year,75.0,12,0.0,1.0,1.0
