# Categorical Encoding 
Categorical coding is a technique used in machine learning to represent categorical data in a numerical form that can be used as input to machine learning models. Categorical data represents non-numeric data such as: For example, the name of the country or the types of products sold in the store.

There are several types of category coding techniques such as one-hot coding, ordinal coding, and target coding. One-hot encoding creates a binary vector for each category, with each element in the vector representing whether the category is present. Ordinal coding assigns a numerical value to each category based on rank or order. Target Encoding calculates the average target value for each category and replaces the category with that value.

Categorical coding is important because many machine learning algorithms require numerical data as input. Transforming categorical data into numerical data makes it easier to process and analyze, and machine learning models can be trained to make predictions based on this data. 

**Modules Used**
* Numpy 
* Pandas 

**Note**
* Pandas is only used to access the dataset at various levels
* The dataset used it [this ](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data)

In [63]:
import pandas as pd 
import numpy as np 

In [64]:
data = pd.read_csv("/content/data.csv")

In [65]:
data

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


`MSZoning` seems to be a good column for testing our code, so we will firstly try to focus on this 

In [66]:
sample_data = data["MSZoning"]

# One Hot Encoding 

One Hot Encoding is a popular technique used for categorical coding in machine learning. Convert a categorical variable to a binary format that can be used as input for a machine learning model.

This technique creates a separate binary feature for each category of a categorical variable. Each binary feature represents whether a category exists in the original data. For example, if you have three categories, A, B, and C, three binary features will be created, one for each category. If a data point belongs to category A, the category A binary characteristic has the value 1 and the categories B and C binary characteristics have the value 0.

One Hot Encoding is especially useful for nominal categorical variables that have no category-specific ordering. This technique ensures that machine learning models do not interpret the numbers associated with categories as having any inherent meaning or order.

The drawback of One Hot Encoding is that datasets with many categories have a large number of features, which can lead to computational resource problems and overfitting. Therefore, it is important to carefully consider using One Hot Encoding and use other techniques such as feature selection and dimensionality reduction if necessary. 

Lets first take a look at the dataset

In [67]:
sample_data

0       RL
1       RL
2       RL
3       RL
4       RL
        ..
1455    RL
1456    RL
1457    RL
1458    RL
1459    RL
Name: MSZoning, Length: 1460, dtype: object

We can get the unique values of the charachters appeard in the dataset by the `pd.DataFrame().unique()` function 

In [68]:
print(sample_data.unique())
type(sample_data.unique())

['RL' 'RM' 'C (all)' 'FV' 'RH']


numpy.ndarray

This is a `numpy array` of objects as you can see

But there is a problem with this array or `unique` function, it is not sorted, nor it provided any functionality or any lead to sort these values according to the number of occurecnes in the dataset. So we will be rather using `pd.DataFrame().value_counts()` function to both access the values as well as there number of ocruncess in the dataset

In [70]:
print(sample_data.value_counts())
type(sample_data.value_counts())

RL         1151
RM          218
FV           65
RH           16
C (all)      10
Name: MSZoning, dtype: int64


pandas.core.series.Series

You can access only the categories by adding the `index` as `pd.DataFrame().value_counts().index` and the ocurrences by specifying the index normally, like this 

In [None]:
ocurrences = [sample_data.value_counts()[x] for x in range(len(sample_data.value_counts()))]
categories = [sample_data.value_counts().index[x] for x in range(len(sample_data.value_counts()))]

In [None]:
ocurrences

[1151, 218, 65, 16, 10]

In [72]:
print(categories)
type(sample_data.value_counts().index)

['RL', 'RM', 'FV', 'RH', 'C (all)']


pandas.core.indexes.base.Index

So we will be using this list to make new columns in our dataset, we will simply run a for loop iterating over the every values, and using the np.where function to make a new column in the end with the binary digits specifying wether the particular category occured in the dataset or not at a specified position. But lets try to do this for on, and then we will apply the for loop 

In [73]:
sample_data = pd.DataFrame(sample_data)
sample_data

Unnamed: 0,MSZoning
0,RL
1,RL
2,RL
3,RL
4,RL
...,...
1455,RL
1456,RL
1457,RL
1458,RL


In [74]:
sample_data["RL"] = np.where(sample_data == "RL" , 1 , 0)

Now lets see our sample dataset

In [75]:
sample_data

Unnamed: 0,MSZoning,RL
0,RL,1
1,RL,1
2,RL,1
3,RL,1
4,RL,1
...,...,...
1455,RL,1
1456,RL,1
1457,RL,1
1458,RL,1


As we can see our most of the work is done, now we just need to apply the loops 

In [76]:
for i in sample_data.value_counts().index:
    sample_data["MSZoning" + "_" + i[0]] = np.where(sample_data["MSZoning"] == i[0] , 1 , 0)

In [77]:
sample_data

Unnamed: 0,MSZoning,RL,MSZoning_RL,MSZoning_RM,MSZoning_FV,MSZoning_RH,MSZoning_C (all)
0,RL,1,1,0,0,0,0
1,RL,1,1,0,0,0,0
2,RL,1,1,0,0,0,0
3,RL,1,1,0,0,0,0
4,RL,1,1,0,0,0,0
...,...,...,...,...,...,...,...
1455,RL,1,1,0,0,0,0
1456,RL,1,1,0,0,0,0
1457,RL,1,1,0,0,0,0
1458,RL,1,1,0,0,0,0


Now we have created our one hot encodder from scrach, We will add some fucntionalities 
* Iterating over a list of columns 
* min_frequency 
* Max_categories

For the **Iterating over a list of columns** , we will be just iterating over the columns given, for better usage we will be putting this into function 

In [78]:
def sample_func(dataframe , columns):
    for i in columns:
                
        for j in dataframe[i].value_counts().index[0]:
                
            dataframe[i + "_" + j[0]] = np.where(dataframe[i] == j[0] , 1 , 0)
                
        dataframe.drop(i , axis = 1 , inplace = True)

This is velnurable to one risk, That is if user enters one column, I dont know why, but numpy is treating the single column as tuple and list of columns as list, We will be adding a if condition to surpass this. **ANY LEADS TO THIS IS HIGHLY APPRICEATED, CREATE A ISSUE ON GITHUB, IF YOU KNOW HOW TO FIX THIS**

In [None]:
def sample_func(dataframe , columns):
    
    if len(columns) == 1 :
        
        for i in columns:
                
            for j in dataframe[i].value_counts().index[0]:
                
                dataframe[i + "_" + j[0]] = np.where(dataframe[i] == j[0] , 1 , 0)
                
            dataframe.drop(i , axis = 1 , inplace = True)

    else : 
        
        for i in columns:
                
            for j in dataframe[i].value_counts().index[0]:
                
                dataframe[i + "_" + j] = np.where(dataframe[i] == j , 1 , 0)
                
            dataframe.drop(i , axis = 1 , inplace = True)
    

Our second functionality will be **min_frequency** 

For this we will 
* just create a list containing those who do not have the minimum number of occurences
* then check if the category do not exist in the list or not 
* * **if True** ,
* * make the respective column
* * **else**
* * make another column as other and put all of the list into that column, For this we will be using the `pd.DataFrame().isin(list)` function 

In [None]:
def sample_func(dataframe , columns , min_frequency = None):

    if not min_frequency == None:
    
        if len(columns) == 1 :
            
            for i in columns:
                
                inf = [j 
                    for j in dataframe[i].value_counts().index 
                    if dataframe[i].value_counts()[j] > min_frequency]
            
                for j in dataframe[i].value_counts().index:
            
                    if not j in inf:
            
                        dataframe[i + "_" + j[0]] = np.where(dataframe[i] == j[0] , 1 , 0)
            
                    else: 
            
                        dataframe[i + "_other"] = np.where(dataframe[i].isin(inf) , 1 , 0)
            
                dataframe.drop(i , axis = 1 , inplace = True)
        else :
            for i in columns:
                
                inf = [j 
                    for j in dataframe[i].value_counts().index 
                    if dataframe[i].value_counts()[j] > min_frequency]
            
                for j in dataframe[i].value_counts().index:
            
                    if not j in inf:
            
                        dataframe[i + "_" + j] = np.where(dataframe[i] == j , 1 , 0)
            
                    else: 
            
                        dataframe[i + "_other"] = np.where(dataframe[i].isin(inf) , 1 , 0)
            
                dataframe.drop(i , axis = 1 , inplace = True) 
    else :    
        
        if len(columns) == 1:

            for i in columns:
                    
                for j in dataframe[i].value_counts().index[0]:
                    
                    dataframe[i + "_" + j[0]] = np.where(dataframe[i] == j[0] , 1 , 0)
                    
                dataframe.drop(i , axis = 1 , inplace = True)

        else : 
        
            for i in columns:
                    
                for j in dataframe[i].value_counts().index[0]:
                    
                    dataframe[i + "_" + j] = np.where(dataframe[i] == j , 1 , 0)
                    
                dataframe.drop(i , axis = 1 , inplace = True)
        

Now we will add the functionality of **max_categories**, For this we will just access the original list of categores and place a kink at the hyperparameter(max_categories) and apply the same implemntation as we did for the the min_frequency 

In [None]:
def sample_func(dataframe , columns , min_frequency = None , max_categories = None):

    if not min_frequency == None:
    
        if len(columns) == 1 :
            
            for i in columns:
                
                inf = [j 
                    for j in dataframe[i].value_counts().index 
                    if dataframe[i].value_counts()[j] > min_frequency]
            
                for j in dataframe[i].value_counts().index:
            
                    if not j in inf:
            
                        dataframe[i + "_" + j[0]] = np.where(dataframe[i] == j[0] , 1 , 0)
            
                    else: 
            
                        dataframe[i + "_other"] = np.where(dataframe[i].isin(inf) , 1 , 0)
            
                dataframe.drop(i , axis = 1 , inplace = True)
        else :
            for i in columns:
                
                inf = [j 
                    for j in dataframe[i].value_counts().index 
                    if dataframe[i].value_counts()[j] > min_frequency]
            
                for j in dataframe[i].value_counts().index:
            
                    if not j in inf:
            
                        dataframe[i + "_" + j] = np.where(dataframe[i] == j , 1 , 0)
            
                    else: 
            
                        dataframe[i + "_other"] = np.where(dataframe[i].isin(inf) , 1 , 0)
            
                dataframe.drop(i , axis = 1 , inplace = True) 

    elif not max_categories == None:
        
        if len(columns) == 1:

            for i in columns:
        
                inf = dataframe[i].value_counts().index[max_categories : ]
        
                for j in dataframe[i].value_counts().index[: max_categories]:
        
                    dataframe[i + "_" + j[0]] = np.where(dataframe[i] == j[0] , 1 , 0)
        
                dataframe[i + "_other"] = np.where(dataframe[i].isin(inf) , 1 , 0)
                dataframe.drop(i , axis = 1 , inplace = True)

        else :

            for i in columns:
        
                inf = dataframe[i].value_counts().index[max_categories : ]
        
                for j in dataframe[i].value_counts().index[: max_categories]:
        
                    dataframe[i + "_" + j] = np.where(dataframe[i] == j , 1 , 0)
        
                dataframe[i + "_other"] = np.where(dataframe[i].isin(inf) , 1 , 0)
                dataframe.drop(i , axis = 1 , inplace = True)

    else :    
        
        if len(columns) == 1:

            for i in columns:
                    
                for j in dataframe[i].value_counts().index[0]:
                    
                    dataframe[i + "_" + j[0]] = np.where(dataframe[i] == j[0] , 1 , 0)
                    
                dataframe.drop(i , axis = 1 , inplace = True)

        else : 
        
            for i in columns:
                    
                for j in dataframe[i].value_counts().index[0]:
                    
                    dataframe[i + "_" + j] = np.where(dataframe[i] == j , 1 , 0)
                    
                dataframe.drop(i , axis = 1 , inplace = True)
        

Now we will just make some small tweeks in this and name it in a good way 

In [79]:
def one_hot_encoder(dataframe , columns , min_frequency = None , max_categories = None):
    
    if min_frequency == None or max_categories == None :    
        
        if len(columns) == 1:

            for i in columns:
                    
                for j in dataframe[i].value_counts().index[0]:
                    
                    dataframe[i + "_" + j[0]] = np.where(dataframe[i] == j[0] , 1 , 0)
                    
                dataframe.drop(i , axis = 1 , inplace = True)

        else : 
        
            for i in columns:
                    
                for j in dataframe[i].value_counts().index[0]:
                    
                    dataframe[i + "_" + j] = np.where(dataframe[i] == j , 1 , 0)
                    
                dataframe.drop(i , axis = 1 , inplace = True)
        

    elif not min_frequency == None:
    
        if len(columns) == 1 :
            
            for i in columns:
                
                inf = [j 
                    for j in dataframe[i].value_counts().index 
                    if dataframe[i].value_counts()[j] > min_frequency]
            
                for j in dataframe[i].value_counts().index:
            
                    if not j in inf:
            
                        dataframe[i + "_" + j[0]] = np.where(dataframe[i] == j[0] , 1 , 0)
            
                    else: 
            
                        dataframe[i + "_other"] = np.where(dataframe[i].isin(inf) , 1 , 0)
            
                dataframe.drop(i , axis = 1 , inplace = True)
        else :
            for i in columns:
                
                inf = [j 
                    for j in dataframe[i].value_counts().index 
                    if dataframe[i].value_counts()[j] > min_frequency]
            
                for j in dataframe[i].value_counts().index:
            
                    if not j in inf:
            
                        dataframe[i + "_" + j] = np.where(dataframe[i] == j , 1 , 0)
            
                    else: 
            
                        dataframe[i + "_other"] = np.where(dataframe[i].isin(inf) , 1 , 0)
            
                dataframe.drop(i , axis = 1 , inplace = True) 

    elif not max_categories == None:
        
        if len(columns) == 1:

            for i in columns:
        
                inf = dataframe[i].value_counts().index[max_categories : ]
        
                for j in dataframe[i].value_counts().index[: max_categories]:
        
                    dataframe[i + "_" + j[0]] = np.where(dataframe[i] == j[0] , 1 , 0)
        
                dataframe[i + "_other"] = np.where(dataframe[i].isin(inf) , 1 , 0)
                dataframe.drop(i , axis = 1 , inplace = True)

        else :

            for i in columns:
        
                inf = dataframe[i].value_counts().index[max_categories : ]
        
                for j in dataframe[i].value_counts().index[: max_categories]:
        
                    dataframe[i + "_" + j] = np.where(dataframe[i] == j , 1 , 0)
        
                dataframe[i + "_other"] = np.where(dataframe[i].isin(inf) , 1 , 0)
                dataframe.drop(i , axis = 1 , inplace = True)
