# Introduction

Moto: "garbage in, garbage out". Feeding dirty data into a model will give results that are meaningless. Steps for improving data quality:

1.   Getting the data - this is rather easy since the texts are pre-uploded.
2.   Cleaning the data - use popular text pre-processing techniques.
3. Organizing the data - organize the cleaned data in a way that is easy to input into machine learning algorithms.

The output of this notebook will be clean, organized data in two standard text formats:

1. Corpus - a matrix storing collections of text.
2. Term Frequency - Inverse Document Frequency Table - another matrix consisting of word weights in relation to how often they appear in the texts.
3. TfidfVectorizer - an instance of the TfidfVectorizer class since it may be needed later.

## Getting The Data

Input: Names of files containing authour's texts.

Ouput: Corpus - a matrix with rows the first column in which is a sample text and the second the author who wrote it.

In [None]:
import pandas as pd

pd.set_option('max_colwidth', 150)
corpus = pd.DataFrame(columns=['text', 'author'])
corpora_size = 0

In [None]:
authors = {
    'Ivan Vazov': ['/content/drive/MyDrive/Colab Notebooks/project/data/vazov_separated/Ivan_Vazov_-_Pod_igoto_-_1773-b.txt',
                   '/content/drive/MyDrive/Colab Notebooks/project/data/vazov_separated/Ivan_Vazov_-_Epopeja_na_zabravenite_-_3-b.txt'],
    'Jordan Jovkov': ['/content/drive/MyDrive/Colab Notebooks/project/data/jovkov_separated/Jordan_Jovkov_-_Chiflikyt_kraj_granitsata_-_2033-b.txt',
                      '/content/drive/MyDrive/Colab Notebooks/project/data/jovkov_separated/Jordan_Jovkov_-_Prikljuchenijata_na_Gorolomov_-_2034-b.txt',
                      '/content/drive/MyDrive/Colab Notebooks/project/data/jovkov_separated/Jordan_Jovkov_-_Staroplaninski_legendi_-_522-b.txt',
                      '/content/drive/MyDrive/Colab Notebooks/project/data/jovkov_separated/Jordan Jovkov -  - . Posledna radost - 7896.txt',
                      '/content/drive/MyDrive/Colab Notebooks/project/data/jovkov_separated/Jordan_Jovkov_-_Vecheri_v_Antimovskija_han_-_517-b.txt']
}

In [None]:
for author, texts in authors.items():
    authors[author] = ''
    
    for text in texts:
        authors[author] += open(text, 'r').read()

    total_chars = len(authors[author])
    to_get = round(total_chars / 100)
    print(f'Total number of characters for {author}: {total_chars:,}. Going to create 100 samples with length {to_get:,}.\n')

    paragraphs = []

    for i in range(100):
      paragraph = authors[author][i * to_get:][:to_get]

      paragraphs.append(paragraph)

      corpus.loc[corpora_size] = [paragraph, author]
      corpora_size += 1

Total number of characters for Ivan Vazov: 788,477. Going to create 100 samples with length 7,885.

Total number of characters for Jordan Jovkov: 1,191,124. Going to create 100 samples with length 11,911.



## Cleaning The Data

By using common data cleaning steps on all texts, pre-process the data so as to remove any noise.

1. Make text all lower case.
2. Remove punctuation.
3. Remove non-bulgarian words (helps with removing roman numbers in chapter headers).
4. Tokenize text by using whitespace as a word boundary.

More data cleaning steps after tokenization:

1. Remove stop words.
2. Lemmatization.
3. Stemming.
3. Create bi-grams.

Input: Corpus.

Output: A vector of tokens representing the texts.

In [None]:
! pip install lemmagen3
! pip install bulstem
! pip install stop-words

import regex as re
from nltk import bigrams

from bulstem.stem import BulStemmer 
from lemmagen3 import Lemmatizer 
from stop_words import get_stop_words

def tokenize(raw_text):
    stop_words = get_stop_words('bulgarian')
    lemmatizer = Lemmatizer('bg')
    stemmer = BulStemmer.from_file('/content/drive/MyDrive/Colab Notebooks/project/data/stem_rules_context_2_utf8.txt',
                                   min_freq=2, left_context=2)

    text = raw_text.lower()  # Make lowercase.
    text = re.sub(u'\\p{P}+', "", text)  # Remove punctuation.
    text = re.sub(u'[a-zA-Z]', "", text)  # Remove non-bulgarian words.

    tokens = text.split()  # Split on whitespace
    tokens = [token for token in tokens if token not in stop_words  # Filter out stopwords
              and all(c.isalpha() for c in token)]  # and non-word tokens.

    # Before lemmatization (sample): ['песни', 'македония', 'българският', 'бог', ..
    tokens = [lemmatizer.lemmatize(token) for token in tokens]  # Lemmatization!
    # Before stemming (sample): ['песен', 'македония', 'български', 'бог', ..
    tokens = [stemmer.stem(token) for token in tokens]  # Stemming!

    bi_grams = list(bigrams(tokens))
    tokens += map(lambda x: x[0] + ' ' + x[1], bi_grams)  # Add bi-grams.

    # After pre-processing (sample): ['песен', 'македони', 'българск', 'бог', ..
    return tokens

Collecting lemmagen3
[?25l  Downloading https://files.pythonhosted.org/packages/bd/2d/62bc5d55ae18126db264fb0cd78ad3f0a70cdac3f8919702a620877d61b5/lemmagen3-3.3.1-cp36-cp36m-manylinux2010_x86_64.whl (12.4MB)
[K     |████████████████████████████████| 12.4MB 256kB/s 
[?25hCollecting pybind11>=2.4
[?25l  Downloading https://files.pythonhosted.org/packages/8d/43/7339dbabbc2793718d59703aace4166f53c29ee1c202f6ff5bf8a26c4d91/pybind11-2.6.2-py2.py3-none-any.whl (191kB)
[K     |████████████████████████████████| 194kB 37.7MB/s 
[?25hInstalling collected packages: pybind11, lemmagen3
Successfully installed lemmagen3-3.3.1 pybind11-2.6.2
Collecting bulstem
[?25l  Downloading https://files.pythonhosted.org/packages/14/51/6bea2dfe7088dcb5faa33bd7491753c30cbebd6e9bea4af8de662bd26463/bulstem-0.3.3-py3-none-any.whl (831kB)
[K     |████████████████████████████████| 839kB 5.6MB/s 
[?25hInstalling collected packages: bulstem
Successfully installed bulstem-0.3.3
Collecting stop-words
  Downloading

In [None]:
data_clean = corpus.text.map(lambda x: tokenize(x))
data_clean

0      [иго, прокуд, българ, прекара, одо, скръб, мъки, изпитва, изгуб, отечеств, ум, сърц, душа, постоян, летя, дойд, вдъхновени, напи, тоя, рома, задиш...
1      [ки, съдеб, практи, щя, чуя, обвинени, защит, изда, присъд, погал, главич, потегл, ухо, наймал, сиреч, обид, целун, бузк, народ, умир, наймал, чов...
2      [повър, мрак, завчас, ща, туря, непробива, прегра, стража, фукн, право, мина, вихър, сам, сеймен, отбягн, стража, спогн, улиц, заехт, стъпк, вик, ...
3      [к, река, добродуш, чорбадж, има, леп, девойк, маша, разгел, почерп, гост, хайд, ида, раки, пазя, водениц, прибав, заплашител, познава, емексиз, п...
4      [бележ, воденичар, кралич, марийк, борб, избяга, бряст, хленч, уплаш, отива, манастир, чийт, висок, стена, огр, месечин, беле, тъмен, клон, орех, ...
                                                                               ...                                                                          
195    [вълчиц, отива, връща, отта, вярван, крия, тъдяв, н

## Organizing The Data

The output of this notebook gets generated and saved in pickels here. A quick recap:

- Corpus: a collection of texts.
- Term Frequency - Inverse Document Frequency Table: word weights in a matric format.

### Corpus

In [None]:
# A final look before saving.
corpus

Unnamed: 0,text,author
0,"﻿\tПод игото\n\n\n\n\tПрокуден от България в 1887 година, прекарах около една година в Одеса. Много скръб, много мъки изпитвах там по изгубеното о...",Ivan Vazov
1,"ки съдебната практика, не щя да чуе ни обвинение, ни защита, а издаде присъда: някои погали по главичките, други потегли за ушите, а най-малките —...",Ivan Vazov
2,"се повърне, и мракът завчас щеше да тури непробиваема преграда между него и стражата. Но той фукна право към нея, мина като вихър между самите се...",Ivan Vazov
3,"к и рече добродушно:\n\t— Чорбаджи, ти си имал лепа девойка, машала. Разгеле да почерпи гостите. Хайде, иди за ракия, а ние ще пазим воденицата. —...",Ivan Vazov
4,"бележиха нищо.\n\tСлед малко воденичарят, Краличът и Марийка, която във време на борбата беше избягала под един бряст и хленчеше уплашено, отиваха...",Ivan Vazov
...,...,...
195,"че вълчицата отива и се връща оттам. По за вярване е, че тя се крие тъдява някъде из нивите. Те са изкласили, високи колкото човешки бой, и в тях...",Jordan Jovkov
196,"се увери, че наистина всичко сочи на суша. Сенките бяха стигнали вече досред пътя, а още беше горещо, пепелта пареше. Чубрата се усмихваше, защото...",Jordan Jovkov
197,"инаваше оттука, не беше за друго, а защото по тия отстранени места, по които не ходеха нито хора, нито добитък, имаше хубава трева. На отиване и н...",Jordan Jovkov
198,"а на друго място. Никой вече не може да го намери де е.\n\tОдърът заскърца, старата захвърли с мъка тежките черти и се поизправи. Очите й не мигва...",Jordan Jovkov


In [None]:
# Pickle!
corpus.to_pickle('/content/drive/MyDrive/Colab Notebooks/project/data/corpus.pkl')

### Term Frequency - Inverse Document Frequency Table

Constructed using scikit-learn's TfidfVectorizer, where every row represents a different document / sample / excerpt from a text and every column will represent a different word.

Because the text that will be passed to the vectorizer is already pre-processed and tokenized some additional attributes have to passed that substitute the built-in functionality with the identity function.

In addition, with TfidfVectorizer, terms that appear too infrequently can be removed. In this case those that appear in less than 2 documents are ignored.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def identity(x):
  return x

tfidf = TfidfVectorizer(
    tokenizer=identity,
    preprocessor=identity,
    token_pattern=None,
    lowercase=False,
    stop_words=None,
    min_df=2)

data_tfidf = tfidf.fit_transform(data_clean)
data_table = pd.DataFrame(data_tfidf.toarray(), columns=tfidf.get_feature_names())
data_table.index = data_clean.index
data_table

Unnamed: 0,аа,ааз,аба,абаджи,абич,абя,август,август тая,авджи,авджимихалев,авра,аврамиц,аврамов,австри,австрийск,австрийск химн,автома,авторитет,ага,ага остав,агитатор,агитаци,агн,аго,агони,ад,ада,адвока,адвока град,адвокатск,адов,адрес,адск,аз,аз аз,аз ах,аз баща,аз бог,аз боже,аз видя,...,яма,ямурлу,ямурлу въз,ямурлу изл,ямурлу лице,яна,яна калмучк,янак,янк,янк разносвач,яня,яр,яра,яре,яребиц,ярк,ярк светли,яркочерв,ярослав,ярослав бързобегунек,ярост,ясен,ясен око,ясл,ясн,ясн вижда,ясн висок,ясн лича,ясн нон,ясн отпечата,ясн познава,ясн чува,яснот,ястреб,ястребов,ята,ятага,яхн,яхн бял,яхн кон
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.04303,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.012194,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033947,0.0,0.000000,0.0,0.0,0.000000,0.00000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.009808,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.00000,0.0,0.0,0.014914,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038154,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.026926,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.035007,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.000000,0.026063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.00000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,0.039291,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.011135,0.0,0.0,0.0,0.000000,0.039291,0.0,...,0.034058,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.023593,0.00000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.029482,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.022027,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.00000,0.0,0.0,0.050241,0.0,0.036715,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.015347,0.000000,0.0,0.0,0.0,0.013755,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.000000,0.011830,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.01202,0.0,0.0,0.007899,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
196,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.035437,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.020076,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.020468,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.023613,0.0,0.0,0.000000,0.00000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017718,0.0,0.0,0.0
197,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.024750,0.0,0.0,0.0,0.029111,0.000000,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.00000,0.0,0.0,0.012545,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
198,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.016668,0.0,0.000000,0.016668,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.006295,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.00000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0


In [None]:
# Pickles!
import pickle

data_table.to_pickle('/content/drive/MyDrive/Colab Notebooks/project/data/data_table.pkl')
pickle.dump(tfidf, open('/content/drive/MyDrive/Colab Notebooks/project/data/vectorizer.pkl', 'wb'))