## **Q1:-**  
### **What is data encoding? How is it useful in data science?**

## **Ans:-**

### **Encoding is the process of converting the data or a given sequence of characters, symbols, alphabets etc., into a specified format, for the secured transmission of data. Decoding is the reverse process of encoding which is to extract the information from the converted format.**

## **Q2:-**  
### **What is nominal encoding? Provide an example of how you would use it in a real-world scenario.**

### **Ans:-**

### **When we have a feature where variables are just names and there is no order or rank to this variable's feature. For example: City of person lives in, Gender of person, Marital Status, etc… In the above example, We do not have any order or rank, or sequence.**

## **Q3:-** 
### **In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.**

### **Ans:-**

### **Nominal encoding is preferred over one-hot encoding in situations where you have categorical data with a large number of unique categories or levels. One-hot encoding can result in a high-dimensional and sparse feature space when you have many categories, which can lead to several issues, including increased computational complexity and the curse of dimensionality. In such cases, nominal encoding can be a more practical and efficient choice.**

#### **Example: Movie Genre Classification:-**

##### Suppose you are building a machine learning model to classify movies into different genres based on various features, including genre information. The genre attribute contains categories such as Action, Comedy, Drama, Horror, Romance, Science Fiction, and many more.

##### One-Hot Encoding Approach:
###### If you were to use one-hot encoding for the genre attribute, you would create a binary column for each genre category. In this example, with around 20 or more genres, you would end up with 20 or more binary columns. If a movie belongs to multiple genres, many of these columns would be set to 1, making the dataset very sparse.

##### The issues with one-hot encoding in this scenario are:

##### High Dimensionality: The number of columns grows with the number of unique genres, making the dataset high-dimensional.

##### Sparsity: Most movies belong to only a few genres, leading to sparse data with many zeros.

##### Increased Complexity: The increased number of features can slow down training and require more computational resources.

##### Nominal Encoding Approach:
###### With nominal encoding, you represent the genre attribute using integers or other encoding techniques that preserve the ordinality or hierarchy among categories. For example:

##### Action: 1
##### Comedy: 2
##### Drama: 3
##### Horror: 4
##### Romance: 5
##### Science Fiction: 6
##### By using nominal encoding, you reduce the dimensionality of the feature space to just one column, making it much more manageable and computationally efficient.

##### In this movie genre classification scenario, nominal encoding is preferred because it avoids the issues associated with one-hot encoding, such as high dimensionality and sparsity, while still providing a meaningful representation of the categorical data.

## **Q4:-** 
### **Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.**

### **Ans:-**

### **When you have categorical data with only 5 unique values, you can use one-hot encoding to transform this data into a format suitable for machine learning algorithms. One-hot encoding is a suitable choice in this scenario for the following reasons:**

#### **1.Low Dimensionality:**
#### **2.Preservation of Information:**
#### **3.Compatibility with Machine Learning Algorithms:**
#### **4.Interpretability:**

In [72]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()

## **Q5:-** 
### **In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.**

### **Ans:-**

### **If you use nominal encoding to transform two categorical columns in a dataset, you'll create a new column for each unique category within each of the two categorical columns.**

### **Let's assume the following:**

### **1.The first categorical column has 'n' unique categories.**
### **2.The second categorical column has 'm' unique categories.**
#### For each of these two categorical columns, you'll create 'n' and 'm' new columns, respectively.

### In your case:

### **1.The first categorical column has 5 unique categories.**
### **2.The second categorical column has 5 unique categories.**
#### So, for the first categorical column, you'd create 5 new columns, and for the second categorical column, you'd also create 5 new columns.

#### Therefore, in total, you would create 5 + 5 = 10 new columns using nominal encoding to transform the categorical data in your dataset.

## **Q6:-** 
### **You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.**

### **Ans:-**

### **1.One-Hot Encoding:**
#### **Justification: One-hot encoding is appropriate when dealing with nominal categorical variables (categories with no intrinsic order). It preserves the categorical information, avoids introducing artificial ordinality, and is compatible with most machine learning algorithms.**

### **2.Label Encoding:**
#### **Justification: Label encoding is suitable when there's an inherent order among categories. It allows the model to capture the ordinality but assumes a linear relationship between categories, which may not always be accurate.**

### **3.Target Encoding (Mean Encoding):**
#### **Justification: Target encoding is suitable when dealing with high cardinality categorical variables and when there is a meaningful relationship between the categorical variable and the target variable**

## **Q7:-**
### **You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.**

### **Ans:-**

In [73]:
import pandas as pd
import random
num_rows = 1000
genders = ['male', 'female']
gender_data = [random.choice(genders) for _ in range(num_rows)]
age_data = [random.randint(18, 70) for _ in range(num_rows)]
contract_types = ['monthly', 'one-year', 'two-year']
contract_data = [random.choice(contract_types) for _ in range(num_rows)]
monthly_charges_data = [round(random.uniform(30, 100), 2) for _ in range(num_rows)]
tenure_data = [random.randint(1, 60) for _ in range(num_rows)]
df = pd.DataFrame({
    'gender': gender_data,
    'age': age_data,
    'contract_type': contract_data,
    'monthly_charges': monthly_charges_data,
    'tenure': tenure_data
})
print(df.head())

   gender  age contract_type  monthly_charges  tenure
0  female   40      two-year            52.92      56
1    male   42      two-year            45.93      31
2  female   48      one-year            76.45      31
3    male   22       monthly            47.73      36
4    male   35      two-year            40.52      33


### **Step 1: Data Preprocessing**
### **Step 2: Identifying Categorical Features**
### **Step 3: Choose Encoding Techniques**
#### **1.Label Encoding for Binary Categorical Variables (Gender):**

In [74]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['gender_encoded'] = label_encoder.fit_transform(df['gender'])

#### **2.One-Hot Encoding for Multi-Class Categorical Variables (Contract Type):**

In [75]:
df = pd.get_dummies(df, columns=['contract_type'], prefix=['contract'])

### **Step 4: Drop the Original Categorical Columns**

In [76]:
df.drop(["gender"],axis=1)

Unnamed: 0,age,monthly_charges,tenure,gender_encoded,contract_monthly,contract_one-year,contract_two-year
0,40,52.92,56,0,False,False,True
1,42,45.93,31,1,False,False,True
2,48,76.45,31,0,False,True,False
3,22,47.73,36,1,True,False,False
4,35,40.52,33,1,False,False,True
...,...,...,...,...,...,...,...
995,27,87.80,32,1,False,False,True
996,65,89.01,22,1,True,False,False
997,64,75.28,56,1,False,True,False
998,58,77.07,59,1,False,True,False


### **Step 5: Model Training**

In [77]:
x=df[["age","tenure","gender_encoded","contract_monthly","contract_one-year","contract_two-year"]]
y=df[["monthly_charges"]]

In [78]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.20,random_state=0)