<a href="https://www.kaggle.com/code/ayushs9020/one-hot-encoder-from-scratch?scriptVersionId=128511955" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 1 | Encoding Techniques From Scratch 

Categorical coding is a technique used in machine learning to represent categorical data in a numerical form that can be used as input to machine learning models. Categorical data represents non-numeric data such as: For example, the name of the country or the types of products sold in the store.

There are several types of category coding techniques such as one-hot coding, ordinal coding, and target coding. One-hot encoding creates a binary vector for each category, with each element in the vector representing whether the category is present. Ordinal coding assigns a numerical value to each category based on rank or order. Target Encoding calculates the average target value for each category and replaces the category with that value.

Categorical coding is important because many machine learning algorithms require numerical data as input. Transforming categorical data into numerical data makes it easier to process and analyze, and machine learning models can be trained to make predictions based on this data.

So what is an `encoding technique`???

Lets assume we have a `dataset` like this 

|CGPA|Languages Known|Placement In Ohio|
|---|---|---|
|6.9|Python|1
|4.5|Java|0
|8.8|C++|1
|9.9|Python|1

And now we want to `predict` the `liklehood` of the student being placed with the data like 

|CGPA|Languages Known|Placement In Ohio|
|---|---|---|
|7.9|Python|?

You would say, that just pass all the data into a machine learning model, and then it will help you to predict the results.

But But But

There is a problem, Machine Learning model only accept numeircal dataform, not a categorical or `string` data. So how can we display the information in the column `Languages Known`??.
One way we can do this is `make seperate columns for the designated values` like this 

|CGPA|Languages Known|Python|Java|C++|Placement In Ohio|
|---|---|---|---|---|---|
|6.9|Python|1|0|0|1
|4.5|Java|0|1|0|0
|8.8|C++|0|0|1|0
|9.9|Python|1|0|0|1

and then we can drop the column `Language Known` or repalce with this matrix 

|CGPA|Python|Java|C++|Placement In Ohio|
|---|---|---|---|---|
|6.9|1|0|0|1
|4.5|0|1|0|0
|8.8|0|0|1|0
|9.9|1|0|0|1

and can do the same for the testing data

|CGPA|Python|Java|C++|Placement In Ohio|
|---|---|---|---|---|
|7.9|1|0|0|?

Now we can feed this data into any model and expect results

So what we just learned is an encoding technique called `One Hot Encoding`. Here we treat every value as same and distinct, so every value is equal and then we encode them



Now comes another question. How can we actually do this

**How** - So there are as usual two ways to implement the transformer

* **Using specialized libraries** - Obviously one of the all rounder library the `scikit-learn` gives us the ability to easily apply a transformer on any custom function you want

* **Making your own encoders** - You can always make stuff by yourself, this gives you a `lot of understanding` of the concept and provides you a lot of `pseudo knowledge`.

# 2 | Using Specialized Libraries

```
from sklearn.preprocessing import OneHotEncoder

data = pd.concat([OneHotEncoder(drop = "first" , sparse = False , 
                                categories = "auto").fit_transform(data[cat], 
                data.drop(cat , axis = 1)], 
                axis=1 , join = "inner")
```


# 3 | Making Your Own

Lets assume we have this dataset 

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/house-prices-advanced-regression-techniques/sample_submission.csv
/kaggle/input/house-prices-advanced-regression-techniques/data_description.txt
/kaggle/input/house-prices-advanced-regression-techniques/train.csv
/kaggle/input/house-prices-advanced-regression-techniques/test.csv


In [2]:
import numpy as np
import pandas as pd

First of all lets see how our data really looks like 

In [3]:
data = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/train.csv")

In [4]:
data

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


`MSZoning` seems to be a good column for testing our code, so we will firstly try to focus on this

In [5]:
sample_data = data["MSZoning"]

In [6]:
sample_data

0       RL
1       RL
2       RL
3       RL
4       RL
        ..
1455    RL
1456    RL
1457    RL
1458    RL
1459    RL
Name: MSZoning, Length: 1460, dtype: object

We can get the unique values of the charachters appeard in the dataset by the `pd.DataFrame().unique()` function

In [7]:
sample_data.unique()

array(['RL', 'RM', 'C (all)', 'FV', 'RH'], dtype=object)

In [8]:
type(sample_data.unique())

numpy.ndarray

This is a `numpy array` of objects as you can see

But there is a problem with this array or `unique` function, it is not sorted, nor it provided any functionality or any lead to sort these values according to the number of occurecnes in the dataset. So we will be rather using `pd.DataFrame().value_counts()` function to both access the values as well as there number of ocruncess in the dataset

In [9]:
sample_data.value_counts()

RL         1151
RM          218
FV           65
RH           16
C (all)      10
Name: MSZoning, dtype: int64

In [10]:
type(sample_data.value_counts())

pandas.core.series.Series

As you can see this is a `pandas.core.series.Series`. You can access only the categories by adding the `index` as `pd.DataFrame().value_counts().index` and the ocurrences by specifying the index normally, like this

In [11]:
ocurrences = [sample_data.value_counts()[x] for x in range(len(sample_data.value_counts()))]

In [12]:
ocurrences

[1151, 218, 65, 16, 10]

In [13]:
categories = [sample_data.value_counts().index[x] for x in range(len(sample_data.value_counts()))]

In [14]:
categories

['RL', 'RM', 'FV', 'RH', 'C (all)']

We could also had used `to_dict()` function to make a dictionary of these `key = categories` and `value = occurences` and then do the further processing. I have not considered that method here. Will update in the newer versions of this notebook

So we will be using this list to make new columns in our dataset, we will simply run a for loop iterating over the every values, and using the `np.where` function to make a new column in the end with the binary digits specifying wether the particular category occured in the dataset or not at a specified position. But lets try to do this for on, and then we will apply the for loop

In [15]:
sample_data = pd.DataFrame(sample_data)

In [16]:
sample_data

Unnamed: 0,MSZoning
0,RL
1,RL
2,RL
3,RL
4,RL
...,...
1455,RL
1456,RL
1457,RL
1458,RL


As we defined earlier, we will use `np.where()` to make our work easier 

In [17]:
sample_data["RL"] = np.where(sample_data == "RL" , 1 , 0)

Now lets see our sample dataset

In [18]:
sample_data

Unnamed: 0,MSZoning,RL
0,RL,1
1,RL,1
2,RL,1
3,RL,1
4,RL,1
...,...,...
1455,RL,1
1456,RL,1
1457,RL,1
1458,RL,1


As we can see our most of the work is done, now we just need to apply the loops

In [19]:
sample_data["MSZoning"].value_counts().index

Index(['RL', 'RM', 'FV', 'RH', 'C (all)'], dtype='object')

In [20]:
sample_data

Unnamed: 0,MSZoning,RL
0,RL,1
1,RL,1
2,RL,1
3,RL,1
4,RL,1
...,...,...
1455,RL,1
1456,RL,1
1457,RL,1
1458,RL,1


In [21]:
for i in sample_data.value_counts().index:
    sample_data["MSZoning" + "_" + i[0]] = np.where(sample_data["MSZoning"] == i[0] , 1 , 0)
    
sample_data.drop("MSZoning" + "_" + str(sample_data["MSZoning"].value_counts().index[0]) , axis = 1 , inplace = True)

In [22]:
sample_data

Unnamed: 0,MSZoning,RL,MSZoning_RM,MSZoning_FV,MSZoning_RH,MSZoning_C (all)
0,RL,1,0,0,0,0
1,RL,1,0,0,0,0
2,RL,1,0,0,0,0
3,RL,1,0,0,0,0
4,RL,1,0,0,0,0
...,...,...,...,...,...,...
1455,RL,1,0,0,0,0
1456,RL,1,0,0,0,0
1457,RL,1,0,0,0,0
1458,RL,1,0,0,0,0


Now we have created our one hot encodder from scrach, We will add some fucntionalities

* Iterating over a list of columns
* min_frequency
* Max_categories

For the **Iterating over a list of columns** , we will be just iterating over the columns given, for better usage we will be putting this into function

In [23]:
def sample_func(dataframe , columns):
    for i in columns:
                
        for j in dataframe[i].value_counts().index[0]:
                
            dataframe[i + "_" + j[0]] = np.where(dataframe[i] == j[0] , 1 , 0)
                
        dataframe.drop(i , axis = 1 , inplace = True)

This is velnurable to one risk, That is if user enters one column, I dont know why, but numpy is treating the single column as tuple and list of columns as list, We will be adding a if condition to surpass this. **ANY LEADS TO THIS IS HIGHLY APPRICEATED, COMMENT, IF YOU KNOW HOW TO FIX THIS**

In [24]:
def sample_func(dataframe , columns):
    
    if len(columns) == 1 :
        
        for i in columns:
                
            for j in dataframe[i].value_counts().index[0]:
                
                dataframe[i + "_" + j[0]] = np.where(dataframe[i] == j[0] , 1 , 0)
                
            dataframe.drop(i , axis = 1 , inplace = True)

    else : 
        
        for i in columns:
                
            for j in dataframe[i].value_counts().index[0]:
                
                dataframe[i + "_" + j] = np.where(dataframe[i] == j , 1 , 0)
                
            dataframe.drop(i , axis = 1 , inplace = True)

Our goal is to make a replica of [sklearn.preprocessing.onehotencoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
# 4 | Functionalities
* Parameters 
* * ✅`drop : {‘first’, ‘if_binary’} or an array-like of shape (n_features,), default=None` - Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model. However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models.
* * * ✅`None` : retain all features (the default).
* * * ✅`first` : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.
* * * ✅`if_binary` : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact
* ✅`dtype : number type, default=float` - Desired dtype of output.
* ✅`min_frequency : int or float, default=None` - Specifies the minimum frequency below which a category will be considered infrequent.
* * ✅If `int`, categories with a smaller cardinality will be considered infrequent.
* * ✅If `float`, categories with a smaller cardinality than `min_frequency * n_samples` will be considered infrequent.
* ✅`max_categories : int, default=None` - Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, `max_categories` includes the category representing the infrequent categories along with the frequent categories. If `None`, there is no limit to the number of output features.

# 4.1 | Drop

Lets just take our old example 

In [25]:
sample_data = data["MSZoning"]
sample_data = pd.DataFrame(sample_data)
for i in sample_data.value_counts().index:
    sample_data["MSZoning" + "_" + i[0]] = np.where(sample_data["MSZoning"] == i[0] , 1 , 0)
sample_data

Unnamed: 0,MSZoning,MSZoning_RL,MSZoning_RM,MSZoning_FV,MSZoning_RH,MSZoning_C (all)
0,RL,1,0,0,0,0
1,RL,1,0,0,0,0
2,RL,1,0,0,0,0
3,RL,1,0,0,0,0
4,RL,1,0,0,0,0
...,...,...,...,...,...,...
1455,RL,1,0,0,0,0
1456,RL,1,0,0,0,0
1457,RL,1,0,0,0,0
1458,RL,1,0,0,0,0


Now here we just need to drop the first row, We know that the name of the first row will be `MSZoning_RL`. Lets assume we are only doing this for one column for this time. So we can assume that `MSZoning` and `_` are constant and the only value we have to manually define will be the `RL`. We can take this from the `sample_data["MSZoning"].value_counts().index[0]`.

In [26]:
sample_data.drop("MSZoning" + "_" + str(sample_data["MSZoning"].value_counts().index[0]) , axis = 1 , inplace = True)
sample_data

Unnamed: 0,MSZoning,MSZoning_RM,MSZoning_FV,MSZoning_RH,MSZoning_C (all)
0,RL,0,0,0,0
1,RL,0,0,0,0
2,RL,0,0,0,0
3,RL,0,0,0,0
4,RL,0,0,0,0
...,...,...,...,...,...
1455,RL,0,0,0,0
1456,RL,0,0,0,0
1457,RL,0,0,0,0
1458,RL,0,0,0,0


We can add this functionality to our data, for doing so, we need to apply an `if` condition. and a hyperparmater 

In [27]:
def sample_func(dataframe , columns , drop = None):
    
    if len(columns) == 1 :
        
        for j in dataframe[columne[0]].value_counts().index[0]:

            dataframe[columns[0] + "_" + j[0]] = np.where(dataframe[columns[0]] == j[0] 
                                                          , 1 , 0)
        if drop == "first" :
            
            dataframe.drop(str(columns[0]) + "_" + sample_data[columne[0]].value_counts.index[0] , 
                          axis = 1 , inplace = True)

        dataframe.drop(columns[0] , axis = 1 , inplace = True)

    else : 
        
        for i in columns:
                
            for j in dataframe[i].value_counts().index[0]:
                
                dataframe[i + "_" + j] = np.where(dataframe[i] == j , 1 , 0)
            if drop == "first" :
            
                dataframe.drop(str(i) + "_" + sample_data[i].value_counts.index[0] , 
                               axis = 1 , inplace = True)
            
            dataframe.drop(i , axis = 1 , inplace = True)

Now for doing for the `drop = "if_binary"`.

Now lets try to take a overview of what the function is doing.
* if we choose `drop = None` , then it retains all columns
* if we choose `drop = "first"` , then it removes the first feature
* if we choose `drop = if_binary` , then it removes the first feature

Notice both at `drop = "first"` and `drop = "if_binary"`, then it removes the first feature. Then we can just combine these functions and do the things.

In [28]:
def sample_func(dataframe , columns , drop = None):
    
    if len(columns) == 1 :
        
        for j in dataframe[columne[0]].value_counts().index[0]:

            dataframe[columns[0] + "_" + j[0]] = np.where(dataframe[columns[0]] == j[0] 
                                                          , 1 , 0)
        if drop == "first" or drop == "if_binary":
            
            dataframe.drop(str(columns[0]) + "_" + sample_data[columne[0]].value_counts.index[0] , 
                          axis = 1 , inplace = True)

        dataframe.drop(columns[0] , axis = 1 , inplace = True)

    else : 
        
        for i in columns:
                
            for j in dataframe[i].value_counts().index[0]:
                
                dataframe[i + "_" + j] = np.where(dataframe[i] == j , 1 , 0)
            if drop == "first" or drop == "if_binary" :
            
                dataframe.drop(str(i) + "_" + sample_data[i].value_counts.index[0] , 
                               axis = 1 , inplace = True)
            
            dataframe.drop(i , axis = 1 , inplace = True)

# 4.2 | Dtype

Here we know that the dtype of numerical in the simplest format can be only either `int` or `float`. So we just need to add an `if-else` condtion to do this.

In [29]:
def sample_func(dataframe , columns , min_frequency = None , dtype = float):

    if not min_frequency == None:
    
        if len(columns) == 1 :
                
            inf = [j 
                for j in dataframe[columne[0]].value_counts().index 
                if dataframe[columne[0]].value_counts()[j] > min_frequency]
        
            for j in dataframe[columne[0]].value_counts().index:
        
                if not j in inf:
        
                    dataframe[columne[0] + "_" + j[0]] = np.where(dataframe[columne[0]] == j[0] , 
                                                                  dtype(1) , dtype(0))
        
                else: 
        
                    dataframe[columne[0] + "_other"] = np.where(dataframe[columne[0]].isin(inf) , 
                                                                dtype(1) , dtype(0))
                if drop == "first" or drop == "if_binary":
                    dataframe.drop(str(columns[0]) + "_" + sample_data[columne[0]].value_counts.index[0] , 
                                   axis = 1 , inplace = True)
            dataframe.drop(i , axis = 1 , inplace = True)
        else :
            for i in columns:
                
                inf = [j 
                    for j in dataframe[i].value_counts().index 
                    if dataframe[i].value_counts()[j] > min_frequency]
            
                for j in dataframe[i].value_counts().index:
            
                    if not j in inf:
            
                        dataframe[i + "_" + j] = np.where(dataframe[i] == j , 
                                                          dtype(1) , dtype(0))

                    else: 
            
                        dataframe[i + "_other"] = np.where(dataframe[i].isin(inf) , 
                                                           dtype(1) , dtype(0))
                    if drop == "first" or drop == "if_binary" :
                        dataframe.drop(str(i) + "_" + sample_data[i].value_counts.index[0] , 
                                       axis = 1 , inplace = True)
                dataframe.drop(i , axis = 1 , inplace = True) 
    else :    
        
        if len(columns) == 1:
        
            for j in dataframe[columne[0]].value_counts().index[0]:
                
                dataframe[columne[0] + "_" + j[0]] = np.where(dataframe[columne[0]] == j[0] , 
                                                        dtype(1) , dtype(0))
            if drop == "first" or drop == "if_binary" :
                        dataframe.drop(str(i) + "_" + sample_data[i].value_counts.index[0] , 
                                       axis = 1 , inplace = True)
            dataframe.drop(columne[0] , axis = 1 , inplace = True)

        else : 
        
            for i in columns:
                    
                for j in dataframe[i].value_counts().index[0]:
                    
                    dataframe[i + "_" + j] = np.where(dataframe[i] == j , 
                                                      dtype(1) , dtype(0))
                if drop == "first" or drop == "if_binary" :
                        dataframe.drop(str(i) + "_" + sample_data[i].value_counts.index[0] , 
                                       axis = 1 , inplace = True)
                dataframe.drop(i , axis = 1 , inplace = True)
        

# 4.3 | Min_Frequency
For this we will

* just create a list containing those who do not have the minimum number of occurences
* then check if the category do not exist in the list or not
* * if True ,
* * make the respective column
* * else
* * make another column as other and put all of the list into that column, For this we will be using the pd.DataFrame().isin(list) function

In [30]:
def sample_func(dataframe , columns , min_frequency = None , dtype = float):

    if not min_frequency == None:
    
        if len(columns) == 1 :
                
            inf = [j 
                for j in dataframe[columne[0]].value_counts().index 
                if dataframe[columne[0]].value_counts()[j] > min_frequency]
        
            for j in dataframe[columne[0]].value_counts().index:
        
                if not j in inf:
        
                    dataframe[columne[0] + "_" + j[0]] = np.where(dataframe[columne[0]] == j[0] , 
                                                                  dtype(1) , dtype(0))
        
                else: 
        
                    dataframe[columne[0] + "_other"] = np.where(dataframe[columne[0]].isin(inf) , 
                                                                dtype(1) , dtype(0))
                if drop == "first" or drop == "if_binary":
                    dataframe.drop(str(columns[0]) + "_" + sample_data[columne[0]].value_counts.index[0] , 
                                   axis = 1 , inplace = True)
            dataframe.drop(i , axis = 1 , inplace = True)
        else :
            for i in columns:
                
                inf = [j 
                    for j in dataframe[i].value_counts().index 
                    if dataframe[i].value_counts()[j] > min_frequency]
            
                for j in dataframe[i].value_counts().index:
            
                    if not j in inf:
            
                        dataframe[i + "_" + j] = np.where(dataframe[i] == j , 
                                                          dtype(1) , dtype(0))

                    else: 
            
                        dataframe[i + "_other"] = np.where(dataframe[i].isin(inf) , 
                                                           dtype(1) , dtype(0))
                    if drop == "first" or drop == "if_binary" :
                        dataframe.drop(str(i) + "_" + sample_data[i].value_counts.index[0] , 
                                       axis = 1 , inplace = True)
                dataframe.drop(i , axis = 1 , inplace = True) 
    else :    
        
        if len(columns) == 1:
        
            for j in dataframe[columne[0]].value_counts().index[0]:
                
                dataframe[columne[0] + "_" + j[0]] = np.where(dataframe[columne[0]] == j[0] , 
                                                        dtype(1) , dtype(0))
            if drop == "first" or drop == "if_binary" :
                        dataframe.drop(str(i) + "_" + sample_data[i].value_counts.index[0] , 
                                       axis = 1 , inplace = True)
            dataframe.drop(columne[0] , axis = 1 , inplace = True)

        else : 
        
            for i in columns:
                    
                for j in dataframe[i].value_counts().index[0]:
                    
                    dataframe[i + "_" + j] = np.where(dataframe[i] == j , 
                                                      dtype(1) , dtype(0))
                if drop == "first" or drop == "if_binary" :
                        dataframe.drop(str(i) + "_" + sample_data[i].value_counts.index[0] , 
                                       axis = 1 , inplace = True)
                dataframe.drop(i , axis = 1 , inplace = True)
        

This was for when `min_frequecy = int`, for `float`, we need to multipy this number by `n_samples`

In [31]:
def sample_func(dataframe , columns , min_frequency = None , dtype = float):
    if type(min_frquency) == int:
        pass
    else : 
        min_frequnecy *= len(columns)

    if not min_frequency == None:
    
        if len(columns) == 1 :
                
            inf = [j 
                for j in dataframe[columne[0]].value_counts().index 
                if dataframe[columne[0]].value_counts()[j] > min_frequency]
        
            for j in dataframe[columne[0]].value_counts().index:
        
                if not j in inf:
        
                    dataframe[columne[0] + "_" + j[0]] = np.where(dataframe[columne[0]] == j[0] , 
                                                                  dtype(1) , dtype(0))
        
                else: 
        
                    dataframe[columne[0] + "_other"] = np.where(dataframe[columne[0]].isin(inf) , 
                                                                dtype(1) , dtype(0))
                if drop == "first" or drop == "if_binary":
                    dataframe.drop(str(columns[0]) + "_" + sample_data[columne[0]].value_counts.index[0] , 
                                   axis = 1 , inplace = True)
            dataframe.drop(i , axis = 1 , inplace = True)
        else :
            for i in columns:
                
                inf = [j 
                    for j in dataframe[i].value_counts().index 
                    if dataframe[i].value_counts()[j] > min_frequency]
            
                for j in dataframe[i].value_counts().index:
            
                    if not j in inf:
            
                        dataframe[i + "_" + j] = np.where(dataframe[i] == j , 
                                                          dtype(1) , dtype(0))

                    else: 
            
                        dataframe[i + "_other"] = np.where(dataframe[i].isin(inf) , 
                                                           dtype(1) , dtype(0))
                    if drop == "first" or drop == "if_binary" :
                        dataframe.drop(str(i) + "_" + sample_data[i].value_counts.index[0] , 
                                       axis = 1 , inplace = True)
                dataframe.drop(i , axis = 1 , inplace = True) 
    else :    
        
        if len(columns) == 1:
        
            for j in dataframe[columne[0]].value_counts().index[0]:
                
                dataframe[columne[0] + "_" + j[0]] = np.where(dataframe[columne[0]] == j[0] , 
                                                        dtype(1) , dtype(0))
            if drop == "first" or drop == "if_binary" :
                        dataframe.drop(str(i) + "_" + sample_data[i].value_counts.index[0] , 
                                       axis = 1 , inplace = True)
            dataframe.drop(columne[0] , axis = 1 , inplace = True)

        else : 
        
            for i in columns:
                    
                for j in dataframe[i].value_counts().index[0]:
                    
                    dataframe[i + "_" + j] = np.where(dataframe[i] == j , 
                                                      dtype(1) , dtype(0))
                if drop == "first" or drop == "if_binary" :
                        dataframe.drop(str(i) + "_" + sample_data[i].value_counts.index[0] , 
                                       axis = 1 , inplace = True)
                dataframe.drop(i , axis = 1 , inplace = True)
        

# 4.4 | Max_categories

For this we will just access the original list of categores and place a kink at the hyperparameter(max_categories) and apply the same implemntation as we did for the the min_frequency

In [32]:
def sample_func(dataframe , columns , min_frequency = None , max_categories = None , dtype = float):
    if type(min_frquency) == int:
        pass
    else : 
        min_frequnecy *= len(columns)

    if not min_frequency == None:
    
        if len(columns) == 1 :
            
            inf = [j 
                for j in dataframe[columns[0]].value_counts().index 
                if dataframe[columns[0]].value_counts()[j] > min_frequency]
        
            for j in dataframe[columns[0]].value_counts().index:
        
                if not j in inf:
        
                    dataframe[columns[0] + "_" + j[0]] = np.where(dataframe[columns[0]] == j[0] , dtype(1) , dtype(0))
        
                else: 
        
                    dataframe[columns[0] + "_other"] = np.where(dataframe[columns[0]].isin(inf) , dtype(1) , dtype(0))
                if drop == "first" or drop == "if_binary" :
                        dataframe.drop(str(columne[0]) + "_" + sample_data[columne[0]].value_counts.index[0] , 
                                       axis = 1 , inplace = True)
                dataframe.drop(columne[0] , axis = 1 , inplace = True)
        else :
            for i in columns:
                
                inf = [j 
                    for j in dataframe[i].value_counts().index 
                    if dataframe[i].value_counts()[j] > min_frequency]
            
                for j in dataframe[i].value_counts().index:
            
                    if not j in inf:
            
                        dataframe[i + "_" + j] = np.where(dataframe[i] == j , dtype(1) , dtype(0))
            
                    else: 
            
                        dataframe[i + "_other"] = np.where(dataframe[i].isin(inf) , dtype(1) , dtype(0))
                    if drop == "first" or drop == "if_binary" :
                        dataframe.drop(str(i) + "_" + sample_data[i].value_counts.index[0] , 
                                       axis = 1 , inplace = True)
                dataframe.drop(i , axis = 1 , inplace = True) 

    elif not max_categories == None:
        
        if len(columns) == 1:
    
            inf = dataframe[columnes[0]].value_counts().index[max_categories : ]
    
            for j in dataframe[columnes[0]].value_counts().index[: max_categories]:
    
                dataframe[columnes[0] + "_" + j[0]] = np.where(dataframe[columnes[0]] == j[0] , dtype(1) , dtype(0))
    
            dataframe[columnes[0] + "_other"] = np.where(dataframe[columnes[0]].isin(inf) , dtype(1) , dtype(0))
            if drop == "first" or drop == "if_binary" :
                        dataframe.drop(str(columne[0]) + "_" + sample_data[columne[0]].value_counts.index[0] , 
                                       axis = 1 , inplace = True)
            dataframe.drop(columnes[0] , axis = 1 , inplace = True)

        else :

            for i in columns:
        
                inf = dataframe[i].value_counts().index[max_categories : ]
        
                for j in dataframe[i].value_counts().index[: max_categories]:
        
                    dataframe[i + "_" + j] = np.where(dataframe[i] == j , dtype(1) , dtype(0))
        
                dataframe[i + "_other"] = np.where(dataframe[i].isin(inf) , dtype(1) , dtype(0))
                if drop == "first" or drop == "if_binary" :
                        dataframe.drop(str(i) + "_" + sample_data[i].value_counts.index[0] , 
                                       axis = 1 , inplace = True)
                dataframe.drop(i , axis = 1 , inplace = True)

    else :    
        
        if len(columns) == 1:
                
            for j in dataframe[columns[0]].value_counts().index[0]:
                
                dataframe[columns[0] + "_" + j[0]] = np.where(dataframe[columns[0]] == j[0] , dtype(1) , dtype(0))
            if drop == "first" or drop == "if_binary" :
                        dataframe.drop(str(columns[0]) + "_" + sample_data[columns[0]].value_counts.index[0] , 
                                       axis = 1 , inplace = True)
            dataframe.drop(columns[0] , axis = 1 , inplace = True)

        else : 
        
            for i in columns:
                    
                for j in dataframe[i].value_counts().index[0]:
                    
                    dataframe[i + "_" + j] = np.where(dataframe[i] == j , dtype(1) , dtype(0))
                if drop == "first" or drop == "if_binary" :
                        dataframe.drop(str(i) + "_" + sample_data[i].value_counts.index[0] , 
                                       axis = 1 , inplace = True)    
                dataframe.drop(i , axis = 1 , inplace = True)
        

# 5 | Methods
* ✅`fit_transform(dataframe , columns)` - Fit to data, then transform it. Fits transformer to `dataframe` and returns a transformed version of dataframe.
* * ✅`dataframe : array-like of shape (n_samples, n_features)` - Input samples.

# 5.2 | Fit_Transform

As we are not giving the keyword arguments, we just need to make a class and put all this into a function `fit_transform`

In [33]:
class OneHotEncoder:

    def __init__(self , min_frequency = None , max_categories = None , dtype = float):
        self.min_frequency = min_frequency
        self.max_categories = max_categories
        self.dtype = dtype
    def fit_transform(self , dataframe , columns):

        if type(self.min_frequency) == int:
            
            pass
        
        else : 
            
            self.min_frequency *= len(columns)

        if not self.min_frequency == None:
        
            if len(columns) == 1 :
                
                inf = [categories 
                    for categories in dataframe[columns[0]].value_counts().index 
                    if dataframe[columns[0]].value_counts()[categories] > self.min_frequency]
            
                for categories in dataframe[columns[0]].value_counts().index:
            
                    if not categories in inf:
            
                        dataframe[columns[0] + "_" + categories[0]] = np.where(dataframe[columns[0]] == categories[0] , dtype(1) , dtype(0))
            
                    else: 
            
                        dataframe[columns[0] + "_other"] = np.where(dataframe[columns[0]].isin(inf) , dtype(1) , dtype(0))
        
                    if drop == "first" or drop == "if_binary" :
        
                            dataframe.drop(str(columne[0]) + "_" + sample_data[columne[0]].value_counts.index[0] , 
                                        axis = 1 , inplace = True)
        
                    dataframe.drop(columne[0] , axis = 1 , inplace = True)
            else :
        
                for feature in columns:
                    
                    inf = [categories 
                        for categories in dataframe[feature].value_counts().index 
                        if dataframe[feature].value_counts()[categories] > self.min_frequency]
                
                    for categories in dataframe[feature].value_counts().index:
                
                        if not categories in inf:
                
                            dataframe[feature + "_" + categories] = np.where(dataframe[feature] == categories , dtype(1) , dtype(0))
                
                        else: 
                
                            dataframe[feature + "_other"] = np.where(dataframe[feature].isin(inf) , dtype(1) , dtype(0))
        
                        if drop == "first" or drop == "if_binary" :
        
                            dataframe.drop(str(feature) + "_" + sample_data[feature].value_counts.index[0] , 
                                        axis = 1 , inplace = True)
        
                    dataframe.drop(feature , axis = 1 , inplace = True) 

        elif not self.max_categories == None:
            
            if len(columns) == 1:
        
                inf = dataframe[columnes[0]].value_counts().index[self.max_categories : ]
        
                for categories in dataframe[columnes[0]].value_counts().index[: self.max_categories]:
        
                    dataframe[columnes[0] + "_" + categories[0]] = np.where(dataframe[columnes[0]] == categories[0] , dtype(1) , dtype(0))
        
                dataframe[columnes[0] + "_other"] = np.where(dataframe[columnes[0]].isin(inf) , dtype(1) , dtype(0))
        
                if drop == "first" or drop == "if_binary" :
        
                            dataframe.drop(str(columne[0]) + "_" + sample_data[columne[0]].value_counts.index[0] , 
                                        axis = 1 , inplace = True)
        
                dataframe.drop(columnes[0] , axis = 1 , inplace = True)

            else :

                for feature in columns:
            
                    inf = dataframe[feature].value_counts().index[self.max_categories : ]
            
                    for categories in dataframe[feature].value_counts().index[: max_categories]:
            
                        dataframe[feature + "_" + categories] = np.where(dataframe[feature] == categories , dtype(1) , dtype(0))
            
                    dataframe[feature + "_other"] = np.where(dataframe[feature].isin(inf) , dtype(1) , dtype(0))
        
                    if drop == "first" or drop == "if_binary" :
        
                            dataframe.drop(str(feature) + "_" + sample_data[feature].value_counts.index[0] , 
                                        axis = 1 , inplace = True)
        
                    dataframe.drop(feature , axis = 1 , inplace = True)

        else :    
            
            if len(columns) == 1:
                    
                for categories in dataframe[columns[0]].value_counts().index[0]:
                    
                    dataframe[columns[0] + "_" + categories[0]] = np.where(dataframe[columns[0]] == categories[0] , dtype(1) , dtype(0))
        
                if drop == "first" or drop == "if_binary" :
        
                            dataframe.drop(str(columns[0]) + "_" + sample_data[columns[0]].value_counts.index[0] , 
                                        axis = 1 , inplace = True)
        
                dataframe.drop(columns[0] , axis = 1 , inplace = True)

            else : 
            
                for feature in columns:
                        
                    for categories in dataframe[feature].value_counts().index[0]:
                        
                        dataframe[feature + "_" + categories] = np.where(dataframe[feature] == categories , dtype(1) , dtype(0))
        
                    if drop == "first" or drop == "if_binary" :
        
                            dataframe.drop(str(feature) + "_" + sample_data[feature].value_counts.index[0] , 
                                        axis = 1 , inplace = True)    
        
                    dataframe.drop(feature , axis = 1 , inplace = True)

Now we will just make some small tweeks in this and name it in a good way

# 6 | Final One Hot Encoding Source Code

In [34]:
class OneHotEncoder:

    def __init__(self , min_frequency = None , max_categories = None , dtype = float):
        self.min_frequency = min_frequency
        self.max_categories = max_categories
        self.dtype = dtype
    def fit_transform(self , dataframe , columns):

        if type(self.min_frequency) == int:
            
            pass
        
        else : 
            
            self.min_frequency *= len(columns)

        if not self.min_frequency == None:
        
            if len(columns) == 1 :
                
                inf = [categories 
                    for categories in dataframe[columns[0]].value_counts().index 
                    if dataframe[columns[0]].value_counts()[categories] > self.min_frequency]
            
                for categories in dataframe[columns[0]].value_counts().index:
            
                    if not categories in inf:
            
                        dataframe[columns[0] + "_" + categories[0]] = np.where(dataframe[columns[0]] == categories[0] , dtype(1) , dtype(0))
            
                    else: 
            
                        dataframe[columns[0] + "_other"] = np.where(dataframe[columns[0]].isin(inf) , dtype(1) , dtype(0))
        
                    if drop == "first" or drop == "if_binary" :
        
                            dataframe.drop(str(columne[0]) + "_" + sample_data[columne[0]].value_counts.index[0] , 
                                        axis = 1 , inplace = True)
        
                    dataframe.drop(columne[0] , axis = 1 , inplace = True)
            else :
        
                for feature in columns:
                    
                    inf = [categories 
                        for categories in dataframe[feature].value_counts().index 
                        if dataframe[feature].value_counts()[categories] > self.min_frequency]
                
                    for categories in dataframe[feature].value_counts().index:
                
                        if not categories in inf:
                
                            dataframe[feature + "_" + categories] = np.where(dataframe[feature] == categories , dtype(1) , dtype(0))
                
                        else: 
                
                            dataframe[feature + "_other"] = np.where(dataframe[feature].isin(inf) , dtype(1) , dtype(0))
        
                        if drop == "first" or drop == "if_binary" :
        
                            dataframe.drop(str(feature) + "_" + sample_data[feature].value_counts.index[0] , 
                                        axis = 1 , inplace = True)
        
                    dataframe.drop(feature , axis = 1 , inplace = True) 

        elif not self.max_categories == None:
            
            if len(columns) == 1:
        
                inf = dataframe[columnes[0]].value_counts().index[self.max_categories : ]
        
                for categories in dataframe[columnes[0]].value_counts().index[: self.max_categories]:
        
                    dataframe[columnes[0] + "_" + categories[0]] = np.where(dataframe[columnes[0]] == categories[0] , dtype(1) , dtype(0))
        
                dataframe[columnes[0] + "_other"] = np.where(dataframe[columnes[0]].isin(inf) , dtype(1) , dtype(0))
        
                if drop == "first" or drop == "if_binary" :
        
                            dataframe.drop(str(columne[0]) + "_" + sample_data[columne[0]].value_counts.index[0] , 
                                        axis = 1 , inplace = True)
        
                dataframe.drop(columnes[0] , axis = 1 , inplace = True)

            else :

                for feature in columns:
            
                    inf = dataframe[feature].value_counts().index[self.max_categories : ]
            
                    for categories in dataframe[feature].value_counts().index[: max_categories]:
            
                        dataframe[feature + "_" + categories] = np.where(dataframe[feature] == categories , dtype(1) , dtype(0))
            
                    dataframe[feature + "_other"] = np.where(dataframe[feature].isin(inf) , dtype(1) , dtype(0))
        
                    if drop == "first" or drop == "if_binary" :
        
                            dataframe.drop(str(feature) + "_" + sample_data[feature].value_counts.index[0] , 
                                        axis = 1 , inplace = True)
        
                    dataframe.drop(feature , axis = 1 , inplace = True)

        else :    
            
            if len(columns) == 1:
                    
                for categories in dataframe[columns[0]].value_counts().index[0]:
                    
                    dataframe[columns[0] + "_" + categories[0]] = np.where(dataframe[columns[0]] == categories[0] , dtype(1) , dtype(0))
        
                if drop == "first" or drop == "if_binary" :
        
                            dataframe.drop(str(columns[0]) + "_" + sample_data[columns[0]].value_counts.index[0] , 
                                        axis = 1 , inplace = True)
        
                dataframe.drop(columns[0] , axis = 1 , inplace = True)

            else : 
            
                for feature in columns:
                        
                    for categories in dataframe[feature].value_counts().index[0]:
                        
                        dataframe[feature + "_" + categories] = np.where(dataframe[feature] == categories , dtype(1) , dtype(0))
        
                    if drop == "first" or drop == "if_binary" :
        
                            dataframe.drop(str(feature) + "_" + sample_data[feature].value_counts.index[0] , 
                                        axis = 1 , inplace = True)    
        
                    dataframe.drop(feature , axis = 1 , inplace = True)

**THATS IT FOR TODAY GUYS**

**HOPE YOU UNDERSTOOD AND LIKED MY WORK**

**DONT FORGET TO MAKE AN UPVOTE :)**

**PEACE OUT !!!**