This repository holds functions pivotal for IRTM processing. This repo. is staged for continuous development.
pip install git+https://github.com/KanishkNavale/IRTM-Toolbox.git
from irtm.toolbox import *
-
Soundex: A phonetic algorithm for indexing names by sound, as pronounced in English.
print(soundex('Muller')) print(soundex('Mueller'))
>>> 'M466' >>> 'M466'
-
Tokenizer: Converts a sequence of characters into a sequence of tokens.
print(tokenize('LINUX')) print(tokenize('Text Mining 2021'))
>>> ['linux'] >>> ['text', 'mining']
-
Vectorize: Converts a string to token based weight tensor.
vector = vectorize([ 'texts ([string]): a multiline or a single line string.', 'dict ([list], optional): list of tokens. Defaults to None.', 'enable_Idf (bool, optional): use IDF or not. Defaults to True.', 'normalize (str, optional): normalization of vector. Defaults to l2.', 'max_dim ([int], optional): dimension of vector. Defaults to None.', 'smooth (bool, optional): restricts value >0. Defaults to True.', 'weightedTf (bool, optional): Tf = 1+log(Tf). Defaults to True.', 'return_features (bool, optional): feature vector. Defaults to False.' ]) print(f'Vector Shape={vector.shape}')
>>> Vector Shape=(8, 37)
-
Predict Token Weights: Computes importance of a token based on classification optimization.
dictionary = ['vector', 'string', 'bool'] vector = vectorize([ 'X ([np.array]): vectorized matrix columns arraged as per the dictionary.', 'y ([labels]): True classification labels.', 'epochs ([int]): Optimization epochs.', 'verbose (bool, optional): Enable verbose outputs. Defaults to False.', 'dict ([type], optional): list of tokens. Defaults to None.' ], dict=dictionary) labels = np.random.randint(1, size=(vector.shape[0], 1)) weights = predict_weights(vector, labels, 100, dict=dictionary)
>>> Token-Weights Mappings: {'vector': 0.22097790924850977, 'string': 0.39296369957440075, 'bool': 0.689853175081446}
-
Page Rank: Computes page rank from a chain matrix
chain_matrix = np.array([[0, 0, 1], [1, 0, 1], [0, 1, 0]]) print(page_rank(chain_matrix)) rank, TPM = page_rank(chain_matrix, return_TransMatrix=True) print(f'Page Rank: {rank} \nTransition Probablity Matrix: \n{TPM}')
>>> [0.0047 0.997 0.0767] >>> Page Rank: [0.0047 0.997 0.0767] Transition Probablity Matrix: [[0.03333333 0.03333333 0.93333333] [0.48333333 0.03333333 0.48333333] [0.03333333 0.93333333 0.03333333]]