In [None]:
"""Q.1
Data encoding, in the context of data science, refers to the process of converting categorical or non-numeric data into a numerical format that can be used for analysis or machine learning tasks. Categorical data represents discrete categories or labels, such as "red," "blue," "green," or "high," "medium," "low," while non-numeric data can include text, dates, or other non-numeric values. Data encoding is useful in data science for the following reasons:
1.Compatibility with Algorithms: Many machine learning algorithms and statistical models require numerical inputs. Data encoding enables you to prepare your data so that it can be fed into these algorithms.
2.Feature Engineering: Encoding categorical data allows you to create meaningful features from categorical variables. For example, you can one-hot encode categorical variables to transform them into binary vectors representing the presence or absence of each category.
3.Reducing Dimensionality: Encoding can reduce dimensionality by converting high-cardinality categorical features into numerical representations. This can be especially valuable in cases where you have limited data and a large number of categories.
4.Comparison and Analysis: Numerical data is easier to compare and analyze than categorical data. It enables statistical analysis, visualization, and exploration of relationships between variables.
5.Machine Learning Models: Many machine learning algorithms, such as decision trees, support vector machines, and neural networks, require numerical data as input. Data encoding is a crucial preprocessing step when using these algorithms for classification, regression, or clustering tasks.

In [None]:
"""Q.2
Nominal encoding, also known as one-hot encoding or binary encoding, is a technique used in data preprocessing to convert categorical data into a numerical format. It's particularly useful for nominal categorical variables, where there is no inherent order among categories. In nominal encoding, each category is represented by a binary column, and each column indicates the presence (1) or absence (0) of a specific category.
Here's an example of how you would use nominal encoding in a real-world scenario:

In [6]:
import pandas as pd
data=pd.DataFrame({
    'Symptom':['Fever','Cough','Headache','Cough','Fever']})
from sklearn.preprocessing import OneHotEncoder
encoder=OneHotEncoder()
encoded=encoder.fit_transform(data)
encoded_data=pd.DataFrame(encoded.toarray(),columns=encoder.get_feature_names_out())
pd.concat([data,encoded_data],axis=1)

Unnamed: 0,Symptom,Symptom_Cough,Symptom_Fever,Symptom_Headache
0,Fever,0.0,1.0,0.0
1,Cough,1.0,0.0,0.0
2,Headache,0.0,0.0,1.0
3,Cough,1.0,0.0,0.0
4,Fever,0.0,1.0,0.0


In [None]:
"""Q.3
Nominal encoding and one-hot encoding are both techniques for converting categorical data into numerical format, but they are suited for different situations. Nominal encoding, also known as label encoding, assigns a unique integer label to each category, while one-hot encoding creates binary columns for each category. Here are situations in which nominal encoding may be preferred over one-hot encoding, along with a practical example:
1.Ordinal Categorical Data:
Scenario: When dealing with ordinal categorical variables, where there is a clear and meaningful order among categories, nominal encoding can be a better choice.
Example: Education levels ("High School," "Bachelor's," "Master's," "Ph.D.") can be encoded as 1, 2, 3, 4, respectively, using nominal encoding because they have a natural order.
2. When Reducing Dimensionality is Critical:
Scenario: In cases where dimensionality reduction is crucial due to computational limitations or the curse of dimensionality, nominal encoding can be preferred.
Example: Consider a dataset with a large number of nominal categories, such as postal codes for a country. Using one-hot encoding would result in an excessive number of binary columns, leading to high-dimensional data. Nominal encoding can be a more compact representation in such cases.
3. Preserve Information in Ordinality:
Scenario: If the categorical variable has ordinality that is essential for the analysis or modeling task, nominal encoding preserves that ordinal information.
Example: Temperature categories ("Low," "Medium," "High") can be encoded as 1, 2, 3 using nominal encoding, allowing models to capture the ordinal relationship.
4. Avoiding Collinearity:
Scenario: In situations where one-hot encoding might introduce multicollinearity issues (correlation between binary columns), nominal encoding can help avoid this problem.
Example: If you have a categorical variable "Season" with categories "Spring," "Summer," "Fall," and "Winter," using one-hot encoding would create columns that are perfectly negatively correlated, which can be problematic for certain models.
5. Space and Memory Efficiency:
Scenario: When working with very large datasets, one-hot encoding can significantly increase memory and storage requirements. Nominal encoding is more memory-efficient.
Example: In a large-scale e-commerce platform with millions of products, encoding product categories using nominal encoding can be more practical than creating a separate binary column for each product category.

In [None]:
"""Q.4
When you have a categorical variable with five unique values, you have two primary encoding techniques to consider: nominal encoding (label encoding) and one-hot encoding. The choice between these techniques depends on the specific characteristics of the categorical variable and your modeling objectives. Here's an explanation for each choice:
Nominal Encoding (Label Encoding):
When to Use: Nominal encoding is suitable when the categorical variable represents ordinal data, meaning there is a meaningful and logical order among the categories. In such cases, you can assign integer labels to the categories based on their natural order.
Why Use It: Nominal encoding is preferred when you want to retain the ordinal information in the variable. It reduces dimensionality compared to one-hot encoding and can be more memory-efficient, which can be advantageous when dealing with large datasets.
Example: Suppose you have a categorical variable representing education levels with categories "High School," "Associate's," "Bachelor's," "Master's," and "Ph.D." You could use nominal encoding to assign integer labels like 0,1, 2, 3, 4.

In [21]:
import pandas as pd
data=pd.DataFrame({
    'Experience(years)':[4,6,5,3,2],
    'Education':[ "High School", "Associate's", "Bachelor's", "Master's", "Ph.D."]})
from sklearn.preprocessing import LabelEncoder
labelencoder=LabelEncoder()
data['Encoded Education'] = labelencoder.fit_transform(data['Education'])
print(data)  # Display the modified DataFrame

   Experience(years)    Education  Encoded Education
0                  4  High School                  2
1                  6  Associate's                  0
2                  5   Bachelor's                  1
3                  3     Master's                  3
4                  2        Ph.D.                  4


In [None]:
"""One-Hot Encoding:
When to Use: One-hot encoding is the preferred choice when dealing with nominal categorical data where there is no inherent order or hierarchy among categories. Each category is transformed into a binary column, making it suitable for machine learning algorithms that require numerical input features.
Why Use It: One-hot encoding is ideal for maintaining the independence of categories. It avoids introducing any artificial order or assumptions about the data, which can be important for certain models like decision trees, neural networks, or clustering algorithms.
Example: If you have a categorical variable representing car colors with categories "Red," "Blue," "Green," "Yellow," and "Black," one-hot encoding would create five binary columns, each representing the presence or absence of a specific color for each data point.

In [62]:
import pandas as pd 
data = pd.DataFrame({
    'CarID': [1, 2, 3, 4, 5],
    'Color': ['Red', 'Blue', 'Green', 'Yellow', 'Black']
})
from sklearn.preprocessing import OneHotEncoder
encoder=OneHotEncoder(sparse=False)
encoded_color=encoder.fit_transform(data[['Color']])
encoded_data=pd.DataFrame(encoded_color,columns=encoder.get_feature_names_out(['Color']))
pd.concat([data,encoded_data],axis=1)



Unnamed: 0,CarID,Color,Color_Black,Color_Blue,Color_Green,Color_Red,Color_Yellow
0,1,Red,0.0,0.0,0.0,1.0,0.0
1,2,Blue,0.0,1.0,0.0,0.0,0.0
2,3,Green,0.0,0.0,1.0,0.0,0.0
3,4,Yellow,0.0,0.0,0.0,0.0,1.0
4,5,Black,1.0,0.0,0.0,0.0,0.0


In [None]:
"""Q.5
If we have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical then,
To calculate the total number of new columns that would be created when using nominal encoding for the categorical data, you need to know the number of unique categories in each of the two categorical columns. Here's the calculation:
unique_categories_column1 = 4  
unique_categories_column2 = 3  
total_new_columns = unique_categories_column1 + unique_categories_column2
                  = 4+3=7
So, if you use nominal encoding to transform the categorical data in your dataset, you would create a total of 7 new columns in addition to the existing numerical columns.                  

In [None]:
"""Q.6
Reason: The "Diet" column is likely nominal categorical data because different animals may have various dietary preferences, and there is no inherent order among diet categories.
Justification: One-hot encoding is suitable for the "Diet" column, similar to the "Species" column. Each unique diet category is transformed into a binary column, ensuring that no artificial order is introduced.

In [58]:
import pandas as pd
data = pd.DataFrame({
    'AnimalID': [1, 2, 3, 4, 5, 6],
    'Species': ['Lion', 'Tiger', 'Panda', 'Giraffe', 'Elephant', 'Racoon'],
    'Habitat': ['Grassland', 'Jungle', 'Bamboo Forest', 'Savannah', 'Savannah', 'Marshes'],
    'Diet': ['Carnivore', 'Carnivore', 'Herbivore', 'Herbivore', 'Herbivore', 'Omnivore']
})
from sklearn.preprocessing import OneHotEncoder
encoder=OneHotEncoder(sparse=False)
encoded=encoder.fit_transform(data[['Diet']])
encoded_data=pd.DataFrame(encoded,columns=encoder.get_feature_names_out(['Diet']))
data1=pd.concat([data,encoded_data],axis=1)
data1



Unnamed: 0,AnimalID,Species,Habitat,Diet,Diet_Carnivore,Diet_Herbivore,Diet_Omnivore
0,1,Lion,Grassland,Carnivore,1.0,0.0,0.0
1,2,Tiger,Jungle,Carnivore,1.0,0.0,0.0
2,3,Panda,Bamboo Forest,Herbivore,0.0,1.0,0.0
3,4,Giraffe,Savannah,Herbivore,0.0,1.0,0.0
4,5,Elephant,Savannah,Herbivore,0.0,1.0,0.0
5,6,Racoon,Marshes,Omnivore,0.0,0.0,1.0


In [69]:
#Q.7
import pandas as pd
# Create a DataFrame with customer information
data=pd.DataFrame({
    'CustomerID': [1, 2, 3, 4, 5],
    'Gender':['Male','Female','Male','Female','Female'],
    'Age':[25,23,35,30,29],
    'Contract type':['Monthly', 'Annual', 'Monthly', 'Annual', 'Monthly'],
    'MonthlyCharges': [50.0, 65.5, 45.0, 85.0, 55.0],
    'Tenure': [12, 24, 6, 36, 8]})
data

Unnamed: 0,CustomerID,Gender,Age,Contract type,MonthlyCharges,Tenure
0,1,Male,25,Monthly,50.0,12
1,2,Female,23,Annual,65.5,24
2,3,Male,35,Monthly,45.0,6
3,4,Female,30,Annual,85.0,36
4,5,Female,29,Monthly,55.0,8


In [71]:
#To transform categorical data into numerical data
# Step 1: Import LabelEncoder from sklearn
from sklearn.preprocessing import LabelEncoder
labelencoder=LabelEncoder() # Step 2: Initialize LabelEncoder objects
data['encoded_gender']=labelencoder.fit_transform(data['Gender']) #Step 3: Apply label encoding to the 'Gender' column
data['encoded_contract_type']=labelencoder.fit_transform(data['Contract type']) # Step 4: Apply label encoding to the 'Contract type' column
print(data)  # Step 5: Display the transformed dataset

   CustomerID  Gender  Age Contract type  MonthlyCharges  Tenure  \
0           1    Male   25       Monthly            50.0      12   
1           2  Female   23        Annual            65.5      24   
2           3    Male   35       Monthly            45.0       6   
3           4  Female   30        Annual            85.0      36   
4           5  Female   29       Monthly            55.0       8   

   encoded_gender  encoded_contract_type  
0               1                      1  
1               0                      0  
2               1                      1  
3               0                      0  
4               0                      1  


In [None]:
# Here male is denoted by 1 and female is denoted by 0.
#Monthly contract is denoted by 1 and annual contract is denoted by 0.