In [None]:
Q1: What is data encoding? How is it useful in data science?

Data Encoding:
Data encoding refers to the process of transforming categorical data into numerical format so that machine learning 
algorithms can process it.
Categorical data is qualitative data that can be divided into categories but has no inherent order or ranking.
Usefulness in Data Science:Machine Learning Compatibility: Many machine learning algorithms require numerical input,
so encoding is essential for using these algorithms on categorical data.
Improved Model Performance: Proper encoding can lead to better model performance by allowing the algorithm to interpret the data correctly.
Data Preprocessing: Encoding is a crucial step in data preprocessing, ensuring that all data is in a
usable format for analysis and model training.

Q2: What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

Nominal Encoding:
    
Nominal encoding, also known as label encoding, assigns a unique integer to each category in the data. 
This method is suitable for categorical data where there is no ordinal relationship between the categories.
Example:
Consider a dataset containing information about fruits with a feature "Fruit Type" which includes categories like "Apple," "Banana," and "Cherry."
Nominal encoding would transform these categories into numerical values.

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Example data
data = {'Fruit Type': ['Apple', 'Banana', 'Cherry', 'Apple', 'Cherry']}
df = pd.DataFrame(data)

# Apply nominal encoding
label_encoder = LabelEncoder()
df['Fruit Type Encoded'] = label_encoder.fit_transform(df['Fruit Type'])

print(df)


Q3: In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

Preferred Situations:
Nominal encoding is preferred when:The categorical variable has many unique categories, leading to a high-dimensional dataset
with one-hot encoding.There is no inherent ordinal relationship between the categories.
Practical Example:
Consider a dataset with a feature "City" having hundreds of unique values. 
Using one-hot encoding would result in a very sparse and high-dimensional dataset.
Nominal encoding would be more efficient.


data = {'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']}
df = pd.DataFrame(data)

label_encoder = LabelEncoder()
df['City Encoded'] = label_encoder.fit_transform(df['City'])

print(df)


Q4: Suppose you have a dataset containing categorical data with 5 unique values. 
Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? 
Explain why you made this choice.

Choice:
If the dataset has only 5 unique categorical values, one-hot encoding is often the best choice.
Reason:Low Cardinality: With only 5 unique values, one-hot encoding will not significantly increase the dimensionality.
No Ordinal Relationship: One-hot encoding is effective when there is no ordinal relationship among the categories.

from sklearn.preprocessing import OneHotEncoder

data = {'Category': ['A', 'B', 'C', 'A', 'B']}
df = pd.DataFrame(data)

# Apply one-hot encoding
one_hot_encoder = OneHotEncoder(sparse=False)
encoded_data = one_hot_encoder.fit_transform(df[['Category']])

encoded_df = pd.DataFrame(encoded_data, columns=one_hot_encoder.get_feature_names_out())
print(encoded_df)


Q5: Nominal Encoding Calculation
Dataset Details:1000 rows and 5 columns2 categorical columns and 3 numerical columns
Calculation:
If you apply nominal encoding to the 2 categorical columns, each unique category in these columns will be converted into a new column.
Suppose the first categorical column has 4 unique values and the second one has 3 unique values.
Number of New Columns Created:
First categorical column: 4 new columns 
Second categorical column: 3 new columnsTotal new columns = 4 + 3 = 7

# Example data
data = {
    'Category1': ['A', 'B', 'C', 'A', 'B', 'C', 'A'],
    'Category2': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'Z'],
    'Numerical1': [1, 2, 3, 4, 5, 6, 7],
    'Numerical2': [7, 6, 5, 4, 3, 2, 1],
    'Numerical3': [1, 1, 1, 2, 2, 2, 3]
}
df = pd.DataFrame(data)

# Apply nominal encoding
encoded_df = pd.get_dummies(df[['Category1', 'Category2']], drop_first=False)
print(encoded_df)
print("Number of new columns created:", encoded_df.shape[1])


Q6: Encoding Technique for Animal DatasetChoice:
    
For a dataset containing information about different types of animals (species, habitat, and diet), one-hot encoding is a suitable technique.
Justification:Categorical Nature: Species, habitat, and diet are categorical variables with no inherent ordinal relationship
.Interpretability: One-hot encoding ensures that the model interprets each category independently.

data = {
    'Species': ['Dog', 'Cat', 'Bird', 'Dog', 'Bird'],
    'Habitat': ['Land', 'Land', 'Air', 'Land', 'Air'],
    'Diet': ['Carnivore', 'Carnivore', 'Herbivore', 'Carnivore', 'Herbivore']
}
df = pd.DataFrame(data)

# Apply one-hot encoding
encoded_df = pd.get_dummies(df, drop_first=False)
print(encoded_df)


Q7: Encoding Technique for Predicting Customer Churn
Steps:Identify categorical columns (gender, contract type)
Apply one-hot encoding to these columns.

pythonCopy codedata = {
    'Gender': ['Male', 'Female', 'Female', 'Male', 'Female'],
    'Age': [25, 30, 45, 35, 40],
    'Contract Type': ['Month-to-Month', 'One Year', 'Month-to-Month', 'Two Year', 'One Year'],
    'Monthly Charges': [70, 80, 90, 60, 75],
    'Tenure': [1, 12, 24, 3, 6]
}
df = pd.DataFrame(data)

# Apply one-hot encoding to categorical columns
encoded_df = pd.get_dummies(df[['Gender', 'Contract Type']], drop_first=False)

# Combine with numerical columns
final_df = pd.concat([df[['Age', 'Monthly Charges', 'Tenure']], encoded_df], axis=1)
print(final_df)