
In Python, encoding refers to the process of converting categorical data into numerical values so that machine learning models can understand and process them. There are several types of encoding methods, both with and without the use of libraries like `scikit-learn`. Below are some common encoding techniques:

### 1. **Label Encoding**
   **Without `sklearn`:**
   - Each category is assigned a unique integer value.
   - Useful when there is an ordinal relationship between the categories.

   ```python
   data = ['cat', 'dog', 'fish', 'dog']
   encoding_dict = {value: idx for idx, value in enumerate(set(data))}
   encoded_data = [encoding_dict[item] for item in data]
   print(encoded_data)
   ```

   **With `sklearn`:**
   - `LabelEncoder` assigns an integer value to each category.

   ```python
   from sklearn.preprocessing import LabelEncoder
   data = ['cat', 'dog', 'fish', 'dog']
   le = LabelEncoder()
   encoded_data = le.fit_transform(data)
   print(encoded_data)
   ```

### 2. **One-Hot Encoding**
   **Without `sklearn`:**
   - Creates a binary column for each category. For each observation, only the corresponding category column is set to 1, while others are set to 0.

   ```python
   import pandas as pd
   data = ['cat', 'dog', 'fish', 'dog']
   df = pd.DataFrame(data, columns=['Animal'])
   one_hot_encoded = pd.get_dummies(df['Animal'])
   print(one_hot_encoded)
   ```

   **With `sklearn`:**
   - `OneHotEncoder` can be used to perform one-hot encoding.

   ```python
   from sklearn.preprocessing import OneHotEncoder
   import numpy as np
   data = np.array(['cat', 'dog', 'fish', 'dog']).reshape(-1, 1)
   encoder = OneHotEncoder(sparse=False)
   one_hot_encoded = encoder.fit_transform(data)
   print(one_hot_encoded)
   ```

### 3. **Ordinal Encoding**
   **Without `sklearn`:**
   - Categories are ordered and encoded as integers based on their order. This method is useful when the data has an ordinal relationship.

   ```python
   data = ['low', 'medium', 'high', 'medium']
   order = {'low': 0, 'medium': 1, 'high': 2}
   encoded_data = [order[item] for item in data]
   print(encoded_data)
   ```

   **With `sklearn`:**
   - `OrdinalEncoder` is used to encode categories based on a predefined order.

   ```python
   from sklearn.preprocessing import OrdinalEncoder
   data = [['low'], ['medium'], ['high'], ['medium']]
   encoder = OrdinalEncoder(categories=[['low', 'medium', 'high']])
   encoded_data = encoder.fit_transform(data)
   print(encoded_data)
   ```

### 4. **Binary Encoding**
   **Without `sklearn`:**
   - Converts categories to binary digits, which are then used to represent the data. This is helpful for high cardinality features.

   ```python
   import pandas as pd
   data = ['cat', 'dog', 'fish', 'dog']
   df = pd.DataFrame(data, columns=['Animal'])
   df['Binary'] = df['Animal'].astype('category').cat.codes
   df['Binary'] = df['Binary'].apply(lambda x: bin(x)[2:].zfill(3))  # Adjust length based on cardinality
   print(df)
   ```

   **With `sklearn`:**
   - `BinaryEncoder` is available in `category_encoders` (not in `sklearn` directly).

   ```python
   import category_encoders as ce
   data = ['cat', 'dog', 'fish', 'dog']
   encoder = ce.BinaryEncoder(cols=['Animal'])
   df = pd.DataFrame(data, columns=['Animal'])
   encoded_data = encoder.fit_transform(df)
   print(encoded_data)
   ```

### 5. **Frequency Encoding**
   **Without `sklearn`:**
   - Each category is replaced by its frequency (how often it occurs in the dataset).

   ```python
   data = ['cat', 'dog', 'fish', 'dog']
   freq = data.count
   encoded_data = [data.count(x) for x in data]
   print(encoded_data)
   ```

   **With `sklearn`:**
   - This method doesn't have a direct implementation in `sklearn`, but can be done manually.

   ```python
   import pandas as pd
   data = ['cat', 'dog', 'fish', 'dog']
   df = pd.DataFrame(data, columns=['Animal'])
   freq_encoded = df['Animal'].map(df['Animal'].value_counts())
   print(freq_encoded)
   ```

### 6. **Target Encoding (Mean Encoding)**
   **Without `sklearn`:**
   - Each category is replaced with the mean of the target variable for that category. It’s useful when the encoding depends on the target variable.

   ```python
   data = ['cat', 'dog', 'fish', 'dog']
   target = [1, 0, 1, 0]
   mapping = {}
   for category in set(data):
       mapping[category] = sum([target[i] for i in range(len(data)) if data[i] == category]) / data.count(category)
   encoded_data = [mapping[item] for item in data]
   print(encoded_data)
   ```

   **With `sklearn`:**
   - This encoding is not directly available in `sklearn`, but can be implemented using custom classes or in conjunction with `category_encoders`.

   ```python
   import category_encoders as ce
   data = ['cat', 'dog', 'fish', 'dog']
   target = [1, 0, 1, 0]
   df = pd.DataFrame({'Animal': data, 'Target': target})
   encoder = ce.TargetEncoder(cols=['Animal'])
   df_encoded = encoder.fit_transform(df['Animal'], df['Target'])
   print(df_encoded)
   ```

### Summary:
- **Label Encoding** and **Ordinal Encoding** are simple techniques for encoding categorical data where the relationship between categories can be ordinal.
- **One-Hot Encoding** is ideal for nominal data, ensuring no ordinal relationship is assumed.
- **Binary Encoding** helps handle high cardinality features effectively.
- **Frequency Encoding** and **Target Encoding** are more advanced and useful when the frequency or target variable influences the encoding.

You can choose the encoding based on the problem you are solving and the nature of the categorical variable.

An **ordinal relationship** between data refers to a situation where the categories or values of a variable have a meaningful order or ranking, but the differences between the categories are not necessarily uniform or measurable. In other words, the categories can be arranged in a specific sequence where one category is considered "greater" or "lesser" than another, but the intervals between categories are not guaranteed to be equal.

### Key Characteristics of Ordinal Data:
1. **Orderable**: The categories can be logically ordered or ranked.
2. **Unequal intervals**: The difference between adjacent categories is not necessarily consistent or measurable. For example, the difference between "low" and "medium" might not be the same as the difference between "medium" and "high."

### Example of Ordinal Data:
Consider a **rating scale** where users rate a product with the following options:
- "Poor"
- "Fair"
- "Good"
- "Excellent"

In this case:
- "Poor" is considered the lowest, while "Excellent" is the highest.
- These categories have a natural order: "Poor" < "Fair" < "Good" < "Excellent".
- However, the difference between "Poor" and "Fair" is not necessarily the same as the difference between "Good" and "Excellent". You cannot say that the difference between "Good" and "Excellent" is exactly the same as the difference between "Fair" and "Good".

### Other Examples of Ordinal Data:
1. **Education Level**: "High School", "Associate's Degree", "Bachelor's Degree", "Master's Degree", "Doctorate"
   - The categories can be ranked, but the difference in education level doesn't follow a consistent, measurable interval.
   
2. **Satisfaction Rating**: "Very Unsatisfied", "Unsatisfied", "Neutral", "Satisfied", "Very Satisfied"
   - The categories have an order based on satisfaction, but the difference between each rating is not necessarily the same.

3. **Socioeconomic Status**: "Low", "Middle", "High"
   - There is an inherent order in terms of wealth or status, but the distance between them is not uniform.

### How Ordinal Encoding Works:
In the case of ordinal data, encoding methods such as **Ordinal Encoding** or **Label Encoding** can be used to assign numerical values that reflect the order of the categories. For example, if we have the satisfaction ratings:
- "Very Unsatisfied" → 1
- "Unsatisfied" → 2
- "Neutral" → 3
- "Satisfied" → 4
- "Very Satisfied" → 5

These numeric values preserve the order, but they do not imply equal distances between the categories.

### When to Use Ordinal Encoding:
Ordinal encoding is most appropriate when the categorical variable has an inherent order, and you want to preserve that order while converting the data into a numerical format suitable for machine learning models.

However, it's important to note that ordinal encoding should not be applied to nominal data (data without an inherent order), as it might impose a false ranking or hierarchy.

In [176]:
import pandas as pd

data = pd.DataFrame({
    'Animal': ['cat', 'dog', 'fish', 'cow', 'sheep', 'cat', 'dog',
               'sheep', 'cat', 'cow', 'fish', 'dog', 'bird', 'cat',
               'bird', 'fish', 'fish', 'dog', 'bird', 'cow'],
    'Count': [2, 4, 5, 1, 4, 2, 2, 
              2, 3, 1, 1, 4, 2, 3, 
              5, 1, 4, 2, 2, 3],
    'Weight': [4.5, 6.2, 5.0, 2.1, 6.8, 4.8, 4.5, 
               6.2, 5.0, 2.1, 6.8, 4.8, 4.5, 6.2, 
               5.0, 2.1, 6.8, 4.8, 6.8, 4.8]
})

data['Animal'] = data['Animal'].astype('category')
categories=data['Animal'].cat.categories

print(data)

   Animal  Count  Weight
0     cat      2     4.5
1     dog      4     6.2
2    fish      5     5.0
3     cow      1     2.1
4   sheep      4     6.8
5     cat      2     4.8
6     dog      2     4.5
7   sheep      2     6.2
8     cat      3     5.0
9     cow      1     2.1
10   fish      1     6.8
11    dog      4     4.8
12   bird      2     4.5
13    cat      3     6.2
14   bird      5     5.0
15   fish      1     2.1
16   fish      4     6.8
17    dog      2     4.8
18   bird      2     6.8
19    cow      3     4.8


In [177]:
from sklearn.preprocessing import LabelEncoder
lb_encoder=LabelEncoder()
lb_data=data.copy(deep=True)
lb_data['Animal']=lb_encoder.fit_transform(data['Animal'])
print(lb_data)



keys = {key: i for i, key in enumerate(categories)}
print(keys)

    Animal  Count  Weight
0        1      2     4.5
1        3      4     6.2
2        4      5     5.0
3        2      1     2.1
4        5      4     6.8
5        1      2     4.8
6        3      2     4.5
7        5      2     6.2
8        1      3     5.0
9        2      1     2.1
10       4      1     6.8
11       3      4     4.8
12       0      2     4.5
13       1      3     6.2
14       0      5     5.0
15       4      1     2.1
16       4      4     6.8
17       3      2     4.8
18       0      2     6.8
19       2      3     4.8
{'bird': 0, 'cat': 1, 'cow': 2, 'dog': 3, 'fish': 4, 'sheep': 5}


In [178]:
from sklearn.preprocessing import OneHotEncoder
hot_encoder=OneHotEncoder(sparse_output=False)
temp=hot_encoder.fit_transform(data[['Animal']])
temp=pd.DataFrame(temp,columns=hot_encoder.categories_[0]).astype(int)
print(temp)


    bird  cat  cow  dog  fish  sheep
0      0    1    0    0     0      0
1      0    0    0    1     0      0
2      0    0    0    0     1      0
3      0    0    1    0     0      0
4      0    0    0    0     0      1
5      0    1    0    0     0      0
6      0    0    0    1     0      0
7      0    0    0    0     0      1
8      0    1    0    0     0      0
9      0    0    1    0     0      0
10     0    0    0    0     1      0
11     0    0    0    1     0      0
12     1    0    0    0     0      0
13     0    1    0    0     0      0
14     1    0    0    0     0      0
15     0    0    0    0     1      0
16     0    0    0    0     1      0
17     0    0    0    1     0      0
18     1    0    0    0     0      0
19     0    0    1    0     0      0


In [179]:
hot_data=data.copy(deep=True)
hot_data=hot_data.drop(columns=['Animal'])
hot_data=temp.join(hot_data)
print(hot_data)

    bird  cat  cow  dog  fish  sheep  Count  Weight
0      0    1    0    0     0      0      2     4.5
1      0    0    0    1     0      0      4     6.2
2      0    0    0    0     1      0      5     5.0
3      0    0    1    0     0      0      1     2.1
4      0    0    0    0     0      1      4     6.8
5      0    1    0    0     0      0      2     4.8
6      0    0    0    1     0      0      2     4.5
7      0    0    0    0     0      1      2     6.2
8      0    1    0    0     0      0      3     5.0
9      0    0    1    0     0      0      1     2.1
10     0    0    0    0     1      0      1     6.8
11     0    0    0    1     0      0      4     4.8
12     1    0    0    0     0      0      2     4.5
13     0    1    0    0     0      0      3     6.2
14     1    0    0    0     0      0      5     5.0
15     0    0    0    0     1      0      1     2.1
16     0    0    0    0     1      0      4     6.8
17     0    0    0    1     0      0      2     4.8
18     1    

In [180]:
import category_encoders as ce
b_encoder = ce.BinaryEncoder(cols=['Animal'])
temp = b_encoder.fit_transform(data['Animal'])


binary_data=data.copy(deep=True)
binary_data=binary_data.drop(columns=['Animal'])
binary_data=temp.join(binary_data)
print(binary_data)


    Animal_0  Animal_1  Animal_2  Count  Weight
0          0         0         1      2     4.5
1          0         1         0      4     6.2
2          0         1         1      5     5.0
3          1         0         0      1     2.1
4          1         0         1      4     6.8
5          0         0         1      2     4.8
6          0         1         0      2     4.5
7          1         0         1      2     6.2
8          0         0         1      3     5.0
9          1         0         0      1     2.1
10         0         1         1      1     6.8
11         0         1         0      4     4.8
12         1         1         0      2     4.5
13         0         0         1      3     6.2
14         1         1         0      5     5.0
15         0         1         1      1     2.1
16         0         1         1      4     6.8
17         0         1         0      2     4.8
18         1         1         0      2     6.8
19         1         0         0      3 



In [181]:
# dict_cat=dict(data['Animal'].value_counts().astype(int))
# print(dict_cat)
# freq_encoded = df['Animal'].map(df['Animal'].value_counts())
# print(freq_encoded)

# tmp=pd.get_dummies(data, columns=['Animal'])
# print(tmp)