# Context

Most traditional machine learning algorithms work with structured data(numbers, categories, boolean data). But how to work with sequences? We can vectorize this sequence: cut the sequence into pieces and increment the value on the corresponding index in the vector. 

# Assignment Instructions

In this task, you need to write a class for vectorizing strings. The work of the class will be tested on genomic chains.

An important parameter is ngram_size. It is responsible for what size we need to take the pieces of the string, will call them tokens. For ngram_size = 4: '​TTTCTGCCA' → ['TTTC', 'TTCT', 'TCTG', 'CTGC', 'TGCC', 'GCCA']. To encode tokens into indexes, first sort them in lexicographic order, and then give integer indexes starting from zero in the same order. Tokens can appear more than once in one row, so we will count the total number of each token in this row. The list of strings is usually called a corpus. The dictionary for converting tokens to indexes is built for the entire corpus. Example of transformation:



In [3]:
ngram_size = 2

corpus = [
    'AATACAT',  # 'AA', 'AT', 'TA', 'AC', 'CA', 'AT'
    'CTACCCT',  # 'CT', 'TA', 'AC', 'CC', 'CC', 'CT'
    'TACCTAC',  # 'TA', 'AC', 'CC', 'CT', 'TA', 'AC'
]

vocab = {'AA': 0, 'AC': 1, 'AT': 2, 'CA': 3, 'CC': 4, 'CT': 5, 'TA': 6}

transformed_corpus = [
    [1, 1, 2, 1, 0, 0, 1],
    [0, 1, 0, 0, 2, 2, 1],
    [0, 2, 0, 0, 1, 1, 2],
]

# Solution

Implement class CountVectorizer:
- f​it  -  build a dictionary "token to index" from the input corpus and save it as an attribute of the class;
- transform  -  transform a new corpus based on a saved dictionary, should return a list of lists. If some token from the new corpus is not represented in the dictionary, then you need to ignore it;
- f​it_transform  -  fit and transform on the same corpus, should return a list of lists.

In [1]:
class CountVectorizer:
    def __init__(self, ngram_size):
        self.ngram_size = ngram_size

    def __matrix_parameters(self, corpus):
        base_matrix = []
        uniq_tokens = set()
        for string in corpus:
            start = 0
            end = self.ngram_size
            submatrix = []
            for i in range(len(string) - self.ngram_size + 1):
                submatrix.append(string[start:end])
                uniq_tokens.add(string[start:end])
                start += 1
                end += 1
            base_matrix.append(submatrix)
        return uniq_tokens, base_matrix

    def fit(self, corpus):
        uniq_tokens = self.__matrix_parameters(corpus)[0]
        setattr(CountVectorizer, 'vocab',
                {k: v for v, k in enumerate(sorted(uniq_tokens))})
        return None

    def transform(self, corpus):
        base_matrix = self.__matrix_parameters(corpus)[1]
        return [[string.count(token) for token in CountVectorizer.vocab]
                for string in base_matrix]

    def fit_transform(self, corpus):
        self.fit(corpus)
        return self.transform(corpus)

Check solution:

In [2]:
corpus = [
    'AATACAT',  # 'AA', 'AT', 'TA', 'AC', 'CA', 'AT'
    'CTACCCT',  # 'CT', 'TA', 'AC', 'CC', 'CC', 'CT'
    'TACCTAC',  # 'TA', 'AC', 'CC', 'CT', 'TA', 'AC'
]

correct_transformation = [
    [1, 1, 2, 1, 0, 0, 1],
    [0, 1, 0, 0, 2, 2, 1],
    [0, 2, 0, 0, 1, 1, 2],
]

# case 1
vectorizer = CountVectorizer(ngram_size=2)
vectorizer.fit(corpus)
vectorizer.transform(corpus) == correct_transformation  # True

# case 2
vectorizer = CountVectorizer(ngram_size=2)
vectorizer.fit_transform(corpus) == correct_transformation  # True

# case 3
corpus_2 = ['TCAATCAC', 'GGGGGGGGGGG', 'AAAA']
vectorizer = CountVectorizer(ngram_size=2)
vectorizer.fit(corpus)
vectorizer.transform(corpus_2) == [
    [1, 1, 1, 2, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 0],
    [3, 0, 0, 0, 0, 0, 0]
]  # True

True