<hr>
<div align="center">
<font size="8">
  <b>String Vectorization</b> 
</font><br>
</div>
<hr>

# Prototype selection

Available methods for choosing prototypes:
- K-Means
- K-Medoids
- CLARA
- CLARANS

Available distances between strings:
- Edit distance (without processing)
- Jaccard (tokenization)
- Euclid-Jaccard (tokenization)

In [47]:
'''
    metric: function with the above format
    
        def metric(str1,str2):
                ...
            return distance
'''

import logging
import numpy as np
import nltk
import tqdm
from tqdm import tqdm
from logging import info as info
from logging import warning as warning
from logging import exception as exception
from logging import error as error


logging.basicConfig(filename='tokenization.log', level=logging.INFO)
info = print

class Tokenizer:
     
    
    def __init__(
        self,
        metric = None, 
        qgrams = None, 
        is_char_tokenization = None, 
        clean = None
    ):
        
        self.metric = metric
        self.qgrams = qgrams
        self.is_char_tokenization = is_char_tokenization
        self.clean = clean
        
        info("Tokenization initialized with.. ")
        info("- Metric: ", self.metric)
        info("- Q-gramms: ", self.qgrams)
        info("- Char-Tokenization: ", self.is_char_tokenization)
        info("- Text cleanning process: ", self.clean)
        
    def process(self, data):
        
        # if isinstance(data, list):
        # elif isinstance(data, pd.DataFrame):
        # elif isinstance(data, np.array):
            
        self.data_size = len(data)
        self.data = np.array(data, dtype = object)
        self.tokenized_data = np.empty([self.data_size], dtype = object)
        
        info("\nProcessing strarts.. ")
        info("- Data size: ", self.data_size)
        
        # self.data_mapping = np.array(input_strings, dtype=object)
        
        for i in tqdm(range(0, self.data_size), desc="Processing.."):
            if self.clean is not None:
                string = self.clean(self.data[i])
            else:
                string = self.data[i]
            # info(string)
            if self.is_char_tokenization:
                self.tokenized_data[i] = set(nltk.ngrams(string, n = self.qgrams))
            else:
                if len(nltk.word_tokenize(string)) > self.qgrams:
                    self.tokenized_data[i] = set(nltk.ngrams(nltk.word_tokenize(string), n = self.qgrams))
                else:
                    self.tokenized_data[i] = set(nltk.ngrams(nltk.word_tokenize(string), n = len(nltk.word_tokenize(string))))
            # info(self.tokenized_data[i])

        return self.tokenized_data
        
dataset = ["abcd", "bcda", "cdab", "dabc"]
tok = Tokenizer('jaccard', 2, False)
tok.process(dataset)

Tokenization initialized with.. 
- Metric:  jaccard
- Q-gramms:  2
- Char-Tokenization:  False
- Text cleanning process:  None

Processing strarts.. 
- Data size:  4


Processing..: 100%|██████████| 4/4 [00:00<00:00, 2777.68it/s]


array([{('abcd',)}, {('bcda',)}, {('cdab',)}, {('dabc',)}], dtype=object)

In [44]:
dataset

['abcd', 'bcda', 'cdab', 'dabc']