Ideas of this notebook were merged to Data Science/Functions project

## Handling categorical data

Ideas:
* https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
list1 = ["paris", "paris", "amsterdam", "tokyo", "amsterdam"]
list2 = [1,1,2,6,2,1,6,2,2,6,6]

In [3]:
X = pd.DataFrame({'country':["paris", "paris", "amsterdam", "tokyo", "amsterdam"], 'count':[1,2,2,6,6]})
X

Unnamed: 0,country,count
0,paris,1
1,paris,2
2,amsterdam,2
3,tokyo,6
4,amsterdam,6


# 1. sklearn.preprocessing.LabelEncoder

* encodes labels based on classes in list
* error when a new instance comes to transform
* works only for one dimensional arrays

In [4]:
from sklearn.preprocessing import LabelEncoder

In [5]:
print(list1, "\n")

le = LabelEncoder()
le.fit(list1)
print("transformed list: ", le.transform(list1))
print("classes: ", le.classes_)
print(le.transform(["paris", "amsterdam", "tokyo"]))

['paris', 'paris', 'amsterdam', 'tokyo', 'amsterdam'] 

transformed list:  [1 1 0 2 0]
classes:  ['amsterdam' 'paris' 'tokyo']
[1 0 2]


In [6]:
print(list2, "\n")

le = LabelEncoder()
le.fit(list2)
print("transformed list: ", le.transform(list2))
print("classes: ", le.classes_)

[1, 1, 2, 6, 2, 1, 6, 2, 2, 6, 6] 

transformed list:  [0 0 1 2 1 0 2 1 1 2 2]
classes:  [1 2 6]


Ordinal encoder does the same, but for 2D arrays:

In [7]:
from sklearn.preprocessing import OrdinalEncoder

In [8]:
enc = OrdinalEncoder()
enc.fit(X)
enc.transform(X)

array([[1., 0.],
       [1., 1.],
       [0., 1.],
       [2., 2.],
       [0., 2.]])

Converting teyt columns to dummy (col 0?)

In [9]:
import category_encoders as ce

In [10]:
encoder = ce.BaseNEncoder()
encoder.fit(X)
encoder.transform(X)

Unnamed: 0,country_0,country_1,country_2,count
0,0,0,1,1
1,0,0,1,2
2,0,1,0,2
3,0,1,1,6
4,0,1,0,6


In [11]:
X2 = pd.DataFrame({'country':["budapest", "paris", "amsterdam", "tokyo", "amsterdam", "tokyo"], 'count':[1,2,2,6,6,1]})
encoder.transform(X2)

Unnamed: 0,country_0,country_1,country_2,count
0,0,0,0,1
1,0,0,1,2
2,0,1,0,2
3,0,1,1,6
4,0,1,0,6
5,0,1,1,1


# Our label encoder
* encode categorical column, list to an integer array
* unique_max = keep the unique_max number of unique elements, or more, if the count is as big as the 10 th most common unique element
* look at the most common elements
* if an unique element occurence is less than the occurence number of the 10 th most common element, it does not get a lebel (gest label -1)
* keep the most common unique elements, label them
* less common elements get label -1
* new instances get label -1

In [12]:
class UniqueEncoder:
    def __init__(self, unique_max=10): 
        self.dict_uniques = {} # counting rate of occurences
        self.dict_idforall = {} # giving initial ids for every unique element
        self.unique_max=unique_max
        
    def fit(self, initial_list):
        for elem in initial_list:
            self.dict_uniques[elem]= self.dict_uniques.get(elem, 0)+1 
            if elem in self.dict_idforall.keys():
                pass
            else:
                next_id=len(self.dict_idforall)
                self.dict_idforall[elem]=next_id
                
        sortedvalues =  sorted(self.dict_uniques.values(), reverse=True)   
        print("sortedvalues: ",sortedvalues)
        if (len(sortedvalues)>self.unique_max): # if there are more than unique_max uniques
            for key, value in self.dict_uniques.items(): 
                if(value<sortedvalues[self.unique_max]):
                    self.dict_idforall[key]=-1
                    
    def transform(self, initial_list):
        output = []
        for elem in initial_list:
            elem_id = self.dict_idforall.get(elem,-1)
            output.append(elem_id)
        return output
    
    def fit_transform(self, initial_list):
        self.fit(initial_list)
        result = self.transform(initial_list)
        return result

In [13]:
enc = UniqueEncoder()
enc.fit(list1)
enc.transform(list1)

sortedvalues:  [2, 2, 1]


[0, 0, 1, 2, 1]

In [14]:
def my_encoder(initlist, convertlist):
    """fits the encoder on initlist and converts convertlist based on that"""
    enc = UniqueEncoder()
    enc.fit(initlist)
    result = enc.transform(convertlist)
    return result

In [15]:
my_encoder([1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,11,12],  [1,10,11,12])

sortedvalues:  [3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1]


[0, 9, 10, -1]

In [16]:
my_encoder([1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12],  [1,10,11,12])

sortedvalues:  [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]


[0, 9, 10, 11]

In [17]:
my_encoder([1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,12,12,12],  [1,10,11,12])

sortedvalues:  [3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1]


[0, 9, -1, 11]

In [18]:
my_encoder([1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,12,12,12],  [1,10,11,12,13])

sortedvalues:  [3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1]


[0, 9, -1, 11, -1]

In [19]:
 my_encoder([1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,11,12,13],  [1,10,11,12,13])

sortedvalues:  [3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1]


[0, 9, 10, -1, -1]