# Creating tables of **letters pairs**

##About
There is a task of decoding "crypted" text, which is based on the number of occurrences of pairs of letters in "natural language text".

---
Є задача декодування "зашифрованого" тексту, яка базується на кількості частоти входжень пар літер в "natural language text".

Цей ноутбук призначений беспосередньо для рішення поставленої задачі: виявити частоту зучтрічей пар літер в тексті.

Порядок дій наступний:
- Читаємо вміст оброблених текстових файлів у словник.
- Створюємо кількісні таблиці по кожному файлу і для всього тексту в цілому.
- Створюємо кількісні та частотні таблиці по всіх текстів враховуючи пробіл та без нього.
- Створюємо `зміщені` кількісні та частотні таблиці по всіх текстів враховуючи пробіл та без нього.

Статистика по ціх табліцях буде в іншому ноутбуці. Це пов'язано з тим, що створення кількісних таблиць по текстових файлах займає багато часу. Тому цей ноутбук використовується "одноразово", тільки для створення таблиць. У подальшому процесі всі створені таблиці читаються з диска (google drive).

The following texts were used:
- (1 HS) "Hamlet" by Shakespeare ([here](https://shakespearestudyguide.com/Hamlet%20Text.html)) [Without expressions like "BERNARDO:"]
- (2 TB) Torah, Sefer Bereshit in English ([here](https://www.tanach.us/Pages/About.html))
- (3 AR) Asimov "I, Robot"  ([here](https://royallib.com/book/Asimov_Isaac/I_Robot.html))
- (4 HP) "Harry Potter and the Philosopher's Stone" ([here](https://github.com/amephraim/nlp/blob/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%201%20-%20Sorcerer's%20Stone.txt))
- (5 GG) Genetics : From Genes to Genomes. 6th edition ([here](http://skgjx.whu.edu.cn/Public/upfile/article/202103031656469260.pdf)) [Intro & parts 1,2,4. My edit from PDF]
- (6 SA) Scientific articles: from IEEE, [medium.com](https://medium.com/) etc.
- (7 NA) News articles: [New York Times](https://www.nytimes.com/international/), [Fox News](https://www.foxnews.com/), [BBC](https://www.bbc.com/news) etc.
- (8 CC) Coments in chats








---
---
## Import & mount

In [1]:
import os
import re
import copy
from time import time
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

# from matplotlib import pyplot as plt
# plt.rcParams['figure.figsize'] = [15, 6]

In [2]:
# Mount GitHub
!git clone https://github.com/EdwardGerman/Columnar-Transposition-Cipher.git  # clone repository
%ls  # checking whether all files are present
drch = '/content/Columnar-Transposition-Cipher' # Path to data

folder_pp = 'Data_pp'
folder_pt = 'Parity_tables'

Mounted at /content/drive


In [3]:
title_font = {'family': 'serif',
              'color':  'darkred',
              'weight': 'bold',
              'size': 14,
              }

---
---
## Funcfions

#### Space_n_letters()
Function for count of all letters and spaces in preproc text

In [4]:
def Space_n_Letters(pp_text, p = False):
    chars_after_PP = len(pp_text)
    spaces = pp_text.count(' ')
    letters = chars_after_PP - spaces

    if p == False:
        return [chars_after_PP, letters, spaces]
    else:
        print('Symbols number:', chars_after_PP)
        print('Spaces number:', spaces)
        print('Only letters: ', letters,'\n')
    # print('Only letters (2): ', len(list(filter(str.isalpha, pp_text))),'\n')

#### CharacterCounts()
Counts the number of each character in the text and print sorted [by decreasing] list (dictionary).

!Noties! optiotal:
* Upper letter = lower letter
* Multyspace -> space

In [5]:
# Returns sorted dictionary {letter : count} from largest to smallest number
def CharactersCount(orig_text, multispace = False, uppercase = False):
    text = copy.copy(orig_text)
    if multispace == False: text = re.sub(r' +', ' ', text)
    if uppercase  == False: text = text.lower()

    counts = {}   # Creating empty dictionary
    for char in text:    # Loop for count characters
        if char in counts: counts[char] += 1
        else:              counts[char]  = 1
    return dict(sorted(counts.items(), key=lambda item: item[1], reverse=True))

#### Display of the DataFrames
Function for better show of the DataFrames

In [6]:
def display_df(df, name = True):
    pd.set_option('display.max_columns', None) # Print all rows (features) in DF
    if name == True and hasattr(df, 'name'): print(df.name + ':')
    display(df)
    print('\n')
    pd.reset_option('display.max_columns')     # Default setting: print print only first & final 5 rows


---
---
---
## Creating ***quantitativity tables***
Order is following:
1. Make a `dictionary` with texts and adding a *united text*
    * Make files list and checking files
    * Read texts from dict
    * Add to dict united text
2. Compiling a `quantitativity tables` for pairs
    * Creating function, what compiled *quantitativity tables*
    * Compiling quantitativity tables by each text including united text


---
### Make a ***dictionary*** with texts and adding a *united text*

Make files list and checking files

In [7]:
# Get list of all files in directory
file_list = os.listdir(os.path.join(drch, folder_pp))

# Checking:
for name in file_list:
    if not name.endswith('_pp.txt'):
        print(name, ' - error pattern: not ended on "_pp.txt"')

# Get list of key names for dictionary
file_names = [name.replace('_pp.txt', '') for name in file_list if name.endswith('_pp.txt')]
display(file_names)

['Torah_Bereshit',
 'Genetics_124',
 'Chat_Comments',
 'Asimov_I_Robot',
 'Harry_Potter_I',
 'News_articles',
 'Hamlet_BB',
 'Sci_articles',
 'All texts']

***Read*** text to dict and ***adding*** united text:

In [8]:
# Read text to dict:
texts = {}  # Create empty dict

# Loop for each file:
for text_file in file_list:

    # Read:
    path_r = os.path.join(drch, folder_pp, text_file)
    with open(path_r, 'r', errors='ignore') as file:
        text = file.read()

    texts[text_file.replace('_pp.txt', '')] = text


### Creating ***quantitative*** tables

Function for compiled `quantitative` tables for each text files

In [9]:
def quantitative_table(name, text):
    # Creating names list for index & columns of DF: 'space' + all letters from 'a' to 'z'
    all_letters = [' '] + [chr(i) for i in range(ord('a'), ord('z') + 1)]

    df = pd.DataFrame(index=all_letters, columns=all_letters)   # call the indexes and columns
    df = df.fillna(0)                                           # reset to zero all values in DF


    # Iterate by text
    for i in range(len(text) - 1):
        first = text[i]
        second = text[i + 1]
        # Increase by 1 value of cell with corresponding row (index) and column
        df.at[first, second] += 1

    df.index = [s + '-' for s in all_letters]
    df.columns = ['-' + s for s in all_letters]
    df.name = name
    return df

Compiling `quantitative` tables for each text

In [10]:
quantit_tables = {}

for name in texts:
    quantit_tables[name] = quantitative_table(name, texts[name])

    # Save DataFrame with quantitative Parity Table to disk (Goodle drive)
    path_pt = os.path.join(drch, folder_pt, name + '.csv')
    quantit_tables[name].to_csv(path_pt, index_label=False)


Save DataFrames with *`quantitative`* `Parity Tables` to disk (Goodle drive)

It could be done in the previous for-loop, but for better understanding, I decided to do it separately.

In [11]:
# for name in quantit_tables:
#     path_pt = os.path.join(drch, folder_pt, name + '.csv')
#     quantit_tables[name].to_csv(path_pt, index_label=False)

## Statistics

Quantitative parity table for `All texts`:

In [12]:
print('All chars:' ,quantit_tables['All texts'].values.sum())
display_df(quantit_tables['All texts'])

All chars: 2263891
All texts:


Unnamed: 0,-,-a,-b,-c,-d,-e,-f,-g,-h,-i,-j,-k,-l,-m,-n,-o,-p,-q,-r,-s,-t,-u,-v,-w,-x,-y,-z
-,0,43411,17690,19257,14681,9638,16856,10019,24691,28471,2121,3059,10513,16610,9263,24615,15370,1162,10588,30310,61470,4807,3340,24373,642,7990,364
a-,11562,276,3297,6124,4707,280,952,2996,718,6242,98,2080,13634,5119,27631,135,2735,23,15403,12045,20611,1971,3351,1013,213,3564,191
b-,1036,2469,358,40,13,7706,3,19,9,1760,139,1,3439,14,47,3821,30,0,2280,442,170,3042,70,0,2,1623,1
c-,2372,7287,21,715,41,7201,53,183,8266,2405,0,2674,1932,28,21,9134,14,20,2215,483,4608,1783,57,0,0,520,11
d-,42369,2368,98,120,759,9265,129,295,23,6460,23,11,667,151,1009,4604,23,85,1668,1848,61,2057,146,106,4,842,5
e-,79473,11005,366,5636,17124,6191,1867,1576,602,2720,47,310,8886,4734,18902,1046,2516,635,28192,17631,7256,338,3643,1269,2582,3594,74
f-,14427,2479,2,21,9,3658,2054,24,4,4152,0,1,1234,1,1,7167,24,0,2661,111,1078,1414,0,16,0,169,1
g-,13712,2424,34,64,36,6488,12,505,3845,1958,1,4,924,222,803,3591,97,5,2991,789,153,1044,1,84,1,355,3
h-,10502,18077,28,20,94,45248,26,4,24,12640,0,5,163,208,238,7340,16,3,2102,244,2589,1474,10,47,3,987,10
i-,4618,2483,1038,7357,6508,4401,3101,4233,63,252,37,1085,6260,5267,34002,8238,1019,145,4347,15540,15180,189,3790,8,356,18,881






Quantitative parity table for each text:

In [13]:
for table in quantit_tables:
    print('-'*60, '\nQuantitative table for "', table, '"',sep = '')
    print('All chars:' ,quantit_tables[table].values.sum())
    display_df(quantit_tables[table], name = False)
    print('\n\n')

------------------------------------------------------------
Quantitative table for "Torah_Bereshit"
All chars: 181244


Unnamed: 0,-,-a,-b,-c,-d,-e,-f,-g,-h,-i,-j,-k,-l,-m,-n,-o,-p,-q,-r,-s,-t,-u,-v,-w,-x,-y,-z
-,0,3859,1784,1049,974,830,1488,900,3129,2055,453,219,1206,1498,614,2158,665,8,417,2745,6038,305,92,2554,0,1263,48
a-,419,148,368,582,399,72,141,122,511,791,0,196,716,854,3120,99,152,0,1092,837,1315,295,376,136,0,436,17
b-,195,171,2,0,2,906,0,0,0,99,0,0,141,1,0,266,0,0,535,33,5,187,0,0,0,86,1
c-,78,560,0,40,0,333,0,0,523,96,0,172,36,0,0,530,0,0,64,5,16,77,0,0,0,1,0
d-,5412,357,3,0,28,391,2,22,0,267,0,0,21,32,72,312,0,0,244,159,3,34,1,18,0,21,0
e-,7376,1001,96,266,1390,425,209,153,154,287,0,49,696,376,1260,88,324,8,2357,974,408,29,297,70,33,375,15
f-,1547,386,0,0,0,240,56,0,0,191,0,0,117,0,0,639,0,0,253,18,179,52,0,0,0,1,0
g-,581,210,0,0,3,200,0,3,384,128,0,0,19,0,39,497,0,0,166,45,7,26,0,0,0,97,0
h-,1534,1701,5,2,1,5802,0,0,6,1739,0,1,9,23,0,577,7,0,94,67,344,132,2,2,0,29,1
i-,493,49,23,214,680,305,267,198,1,5,0,29,763,603,1790,132,69,3,360,1431,952,1,426,0,18,0,19







------------------------------------------------------------
Quantitative table for "Genetics_124"
All chars: 379157


Unnamed: 0,-,-a,-b,-c,-d,-e,-f,-g,-h,-i,-j,-k,-l,-m,-n,-o,-p,-q,-r,-s,-t,-u,-v,-w,-x,-y,-z
-,0,6998,2294,4728,2586,1988,2940,2093,2081,4352,76,262,977,2630,1200,4875,3261,64,1845,3649,9104,559,332,2287,452,798,60
a-,2201,31,478,971,341,10,171,271,2,906,7,139,3288,689,4277,2,607,2,2484,1761,4115,356,348,111,13,205,6
b-,232,323,43,0,4,975,0,0,1,439,29,0,469,0,29,413,2,0,447,95,29,254,3,0,2,326,0
c-,688,1398,12,167,1,1878,49,10,2332,586,0,162,407,1,3,1690,0,3,567,160,1135,389,0,0,0,194,1
d-,5016,197,11,4,69,1681,5,31,1,1803,2,0,80,6,197,426,8,0,228,384,2,682,12,12,0,116,0
e-,12977,1557,35,1438,2678,787,182,415,42,722,5,21,1978,758,4823,176,428,208,4338,4587,1511,83,566,131,793,306,3
f-,3184,248,0,1,0,841,451,0,0,719,0,0,169,0,0,1188,16,0,472,61,99,209,0,1,0,39,0
g-,1889,522,1,6,0,2011,8,120,352,309,0,0,128,37,122,380,0,0,532,148,68,217,0,1,0,54,0
h-,1603,2366,9,0,39,6467,0,1,1,1142,0,1,39,13,56,1000,2,0,1046,26,189,351,0,21,0,223,0
i-,386,499,286,1803,987,777,602,677,41,193,0,117,972,471,6329,2396,196,26,612,3135,2173,11,965,1,35,0,279







------------------------------------------------------------
Quantitative table for "Chat_Comments"
All chars: 277917


Unnamed: 0,-,-a,-b,-c,-d,-e,-f,-g,-h,-i,-j,-k,-l,-m,-n,-o,-p,-q,-r,-s,-t,-u,-v,-w,-x,-y,-z
-,0,5632,2349,2332,1628,1226,1906,1192,2030,4118,397,316,1443,1953,1265,2872,2101,95,1312,3243,7723,617,398,3180,72,913,40
a-,1582,7,539,627,544,81,114,456,57,472,29,285,1889,641,3514,9,269,12,1858,1355,2345,311,530,128,18,520,65
b-,119,348,20,7,3,1045,2,1,6,281,7,1,486,5,5,529,0,0,149,106,23,513,13,0,0,163,0
c-,274,1090,4,108,5,781,1,3,735,371,0,355,235,0,0,1259,2,4,303,63,507,182,2,0,0,55,0
d-,3972,251,4,14,98,1170,8,28,3,812,0,0,62,17,78,634,6,0,173,227,5,231,26,16,0,101,5
e-,9738,1321,83,721,1478,692,189,168,29,463,3,34,925,656,2087,351,298,42,3264,2376,791,58,487,210,279,573,9
f-,1415,315,0,6,3,361,248,14,0,335,0,1,66,0,0,930,4,0,310,3,88,220,0,4,0,11,1
g-,1965,322,7,4,5,807,0,46,386,231,0,0,73,17,104,492,1,5,325,123,9,137,1,1,1,39,1
h-,1033,2137,4,7,3,4569,6,0,6,1436,0,3,32,10,48,1050,3,0,92,33,261,110,0,2,0,154,0
i-,769,435,138,1046,612,560,386,442,4,7,1,246,803,651,4134,819,115,19,618,2266,2146,14,412,4,16,1,89







------------------------------------------------------------
Quantitative table for "Asimov_I_Robot"
All chars: 377896


Unnamed: 0,-,-a,-b,-c,-d,-e,-f,-g,-h,-i,-j,-k,-l,-m,-n,-o,-p,-q,-r,-s,-t,-u,-v,-w,-x,-y,-z
-,0,7331,3087,2813,2572,1557,2362,1483,4451,5429,323,435,1608,2778,1697,4058,2082,285,2207,5139,10210,850,397,5028,5,1782,9
a-,1934,19,513,1131,874,9,123,326,38,1153,11,327,1990,483,5091,1,394,3,2028,2226,3532,230,619,216,20,608,24
b-,73,363,138,3,1,1337,0,0,0,299,20,0,595,5,7,1192,0,0,382,82,22,580,26,0,0,281,0
c-,280,1331,0,102,3,1215,0,1,1104,446,0,428,213,0,0,1192,0,4,298,77,691,347,0,0,0,49,4
d-,7613,333,5,0,156,1381,9,52,10,926,6,2,129,34,142,1132,2,1,305,195,5,115,36,10,0,132,0
e-,13402,1648,31,795,3448,1043,317,222,62,317,11,50,1771,742,2870,82,383,57,4905,2304,1095,41,588,236,366,696,6
f-,2341,387,0,1,0,449,281,0,2,728,0,0,145,0,0,949,1,0,350,5,213,232,0,3,0,15,0
g-,2188,328,0,0,1,991,0,74,824,346,0,1,276,14,97,475,0,0,379,122,19,121,0,0,0,46,0
h-,1569,3013,3,0,8,8053,9,0,1,2361,0,0,37,39,21,1013,1,0,262,46,536,363,0,1,0,199,0
i-,925,397,157,1202,1016,819,549,652,0,0,0,195,999,926,5775,1222,195,8,740,2194,3177,52,453,0,67,0,100







------------------------------------------------------------
Quantitative table for "Harry_Potter_I"
All chars: 412897


Unnamed: 0,-,-a,-b,-c,-d,-e,-f,-g,-h,-i,-j,-k,-l,-m,-n,-o,-p,-q,-r,-s,-t,-u,-v,-w,-x,-y,-z
-,0,7633,3758,2563,2588,1241,2960,1990,8591,4179,348,665,2187,2485,1820,4008,2144,375,1719,6425,11472,938,492,5617,0,1677,16
a-,1905,24,459,920,1450,0,179,970,64,1514,7,490,1845,668,4209,2,563,0,3789,2827,2834,320,557,227,6,739,22
b-,48,561,118,0,0,1593,1,0,0,255,7,0,727,0,1,729,0,0,412,19,7,645,17,0,0,188,0
c-,96,881,0,21,1,760,0,102,1268,141,0,1000,484,2,0,1088,0,0,351,3,244,195,0,0,0,59,0
d-,10483,395,1,3,268,1189,6,73,2,817,0,5,298,19,448,1250,0,0,336,316,1,483,15,22,0,186,0
e-,14249,2195,35,493,4178,1481,315,100,248,364,1,81,1305,769,2373,178,391,5,5230,1941,1238,3,867,331,298,1225,21
f-,1957,453,2,1,0,685,556,0,2,717,0,0,412,0,0,1083,0,0,478,8,241,225,0,8,0,29,0
g-,3208,431,9,0,0,886,0,145,1097,291,0,0,226,3,44,989,0,0,933,169,9,85,0,80,0,12,2
h-,1753,4884,3,5,19,9722,6,0,8,3040,0,0,21,16,6,1628,0,0,318,25,695,277,8,8,0,89,0
i-,670,150,140,815,2045,529,427,733,0,0,0,225,1062,998,6213,573,124,2,977,2019,2505,16,378,0,55,0,159







------------------------------------------------------------
Quantitative table for "News_articles"
All chars: 98257


Unnamed: 0,-,-a,-b,-c,-d,-e,-f,-g,-h,-i,-j,-k,-l,-m,-n,-o,-p,-q,-r,-s,-t,-u,-v,-w,-x,-y,-z
-,0,1906,775,897,616,453,880,279,661,1076,81,66,441,707,386,998,772,32,538,1047,2468,230,114,886,6,155,11
a-,564,1,165,285,229,2,73,182,5,288,11,118,641,311,1268,6,119,1,799,509,851,85,171,42,7,193,35
b-,12,139,11,8,0,299,0,0,0,117,3,0,210,1,0,141,3,0,65,19,6,159,3,0,0,111,0
c-,81,405,4,53,0,416,0,0,309,140,0,97,97,3,14,528,0,1,177,31,190,97,0,0,0,21,1
d-,1451,141,2,1,44,601,2,32,5,386,3,0,20,23,23,110,3,0,79,63,0,89,11,1,0,28,0
e-,3135,430,27,325,818,205,106,89,5,135,4,21,337,200,925,78,132,13,1309,1012,254,20,165,75,123,128,5
f-,547,189,0,0,0,190,110,0,0,196,0,0,99,0,0,334,0,0,176,2,55,42,0,0,0,2,0
g-,653,154,0,0,5,285,0,30,110,99,0,0,31,3,40,133,1,0,140,61,6,54,0,0,0,7,0
h-,388,672,2,0,1,1630,0,0,2,389,0,0,2,2,8,308,0,0,51,16,59,47,0,3,0,19,0
i-,67,265,73,502,318,239,102,193,6,1,36,44,334,195,1603,449,52,11,196,684,688,2,152,1,28,4,20







------------------------------------------------------------
Quantitative table for "Hamlet_BB"
All chars: 211012


Unnamed: 0,-,-a,-b,-c,-d,-e,-f,-g,-h,-i,-j,-k,-l,-m,-n,-o,-p,-q,-r,-s,-t,-u,-v,-w,-x,-y,-z
-,0,4006,1857,1409,1375,828,1598,1008,2786,2918,120,416,1354,2452,1183,2297,1230,135,684,2831,5849,414,254,2724,0,1123,6
a-,940,0,169,415,454,71,101,246,9,440,25,340,889,487,2543,1,194,0,1328,1002,1929,163,459,129,16,560,14
b-,34,146,22,2,3,813,0,0,1,84,12,0,271,0,1,228,0,0,189,39,16,425,1,0,0,176,0
c-,60,450,0,38,11,678,0,0,602,137,0,233,141,1,0,756,0,4,170,11,319,126,0,0,0,31,0
d-,3911,190,0,2,46,864,9,45,2,442,1,1,44,16,43,491,0,0,137,284,3,76,19,3,0,102,0
e-,7344,1347,26,342,814,764,173,80,31,194,4,45,854,378,1837,45,187,22,2560,1471,832,58,256,102,212,222,5
f-,1331,355,0,0,3,323,147,0,0,280,0,0,94,0,0,682,1,0,246,3,120,102,0,0,0,6,0
g-,897,197,0,0,20,486,1,43,400,189,0,0,90,16,35,377,0,0,257,79,15,136,0,0,0,24,0
h-,1220,1969,2,4,19,3924,5,0,0,1565,0,0,11,9,4,1045,0,0,97,17,293,125,0,1,0,202,0
i-,702,167,49,412,283,460,287,294,1,0,0,141,745,503,2662,528,74,9,487,1699,1533,78,264,1,15,0,23







------------------------------------------------------------
Quantitative table for "Sci_articles"
All chars: 325497


Unnamed: 0,-,-a,-b,-c,-d,-e,-f,-g,-h,-i,-j,-k,-l,-m,-n,-o,-p,-q,-r,-s,-t,-u,-v,-w,-x,-y,-z
-,0,6046,1786,3465,2342,1515,2722,1072,960,4343,323,680,1297,2106,1098,3349,3115,168,1866,5231,8606,894,1261,2097,107,279,174
a-,2017,46,606,1193,416,35,50,423,32,678,8,185,2376,986,3609,15,437,5,2025,1528,3690,211,291,24,133,303,8
b-,323,418,4,20,0,738,0,18,1,186,61,0,540,2,4,323,25,0,101,49,62,279,7,0,0,292,0
c-,815,1172,1,186,20,1140,3,67,1393,488,0,227,319,21,4,2091,12,4,285,133,1506,370,55,0,0,110,5
d-,4509,504,72,96,50,1988,88,12,0,1007,11,3,13,4,6,249,4,84,166,220,42,347,26,24,4,156,0
e-,11251,1506,33,1256,2320,794,376,349,31,238,19,9,1020,855,2727,48,373,280,4229,2966,1127,46,417,114,478,69,10
f-,2105,146,0,12,3,569,205,10,0,986,0,0,132,1,1,1362,2,0,376,11,83,332,0,0,0,66,0
g-,2331,260,17,54,2,822,3,44,292,365,1,3,81,132,322,248,95,0,259,42,20,268,0,2,0,76,0
h-,1402,1335,0,2,4,5081,0,3,0,968,0,0,12,96,95,719,3,3,142,14,212,69,0,9,3,72,9
i-,606,521,172,1363,567,712,481,1044,10,46,0,88,582,920,5496,2119,194,67,357,2112,2006,15,740,1,122,13,192







------------------------------------------------------------
Quantitative table for "All texts"
All chars: 2263891


Unnamed: 0,-,-a,-b,-c,-d,-e,-f,-g,-h,-i,-j,-k,-l,-m,-n,-o,-p,-q,-r,-s,-t,-u,-v,-w,-x,-y,-z
-,0,43411,17690,19257,14681,9638,16856,10019,24691,28471,2121,3059,10513,16610,9263,24615,15370,1162,10588,30310,61470,4807,3340,24373,642,7990,364
a-,11562,276,3297,6124,4707,280,952,2996,718,6242,98,2080,13634,5119,27631,135,2735,23,15403,12045,20611,1971,3351,1013,213,3564,191
b-,1036,2469,358,40,13,7706,3,19,9,1760,139,1,3439,14,47,3821,30,0,2280,442,170,3042,70,0,2,1623,1
c-,2372,7287,21,715,41,7201,53,183,8266,2405,0,2674,1932,28,21,9134,14,20,2215,483,4608,1783,57,0,0,520,11
d-,42369,2368,98,120,759,9265,129,295,23,6460,23,11,667,151,1009,4604,23,85,1668,1848,61,2057,146,106,4,842,5
e-,79473,11005,366,5636,17124,6191,1867,1576,602,2720,47,310,8886,4734,18902,1046,2516,635,28192,17631,7256,338,3643,1269,2582,3594,74
f-,14427,2479,2,21,9,3658,2054,24,4,4152,0,1,1234,1,1,7167,24,0,2661,111,1078,1414,0,16,0,169,1
g-,13712,2424,34,64,36,6488,12,505,3845,1958,1,4,924,222,803,3591,97,5,2991,789,153,1044,1,84,1,355,3
h-,10502,18077,28,20,94,45248,26,4,24,12640,0,5,163,208,238,7340,16,3,2102,244,2589,1474,10,47,3,987,10
i-,4618,2483,1038,7357,6508,4401,3101,4233,63,252,37,1085,6260,5267,34002,8238,1019,145,4347,15540,15180,189,3790,8,356,18,881









---
---
# Creating ***frequency*** and ***biased*** tables

We got dict with DataFrames wich contain values of number of pairs for each text files and all texts.

For further work, we interesting only the table for `'All texts'`.

---
## Creating ***frequency*** tables
Now we will create dict with dataframes wich contain:
- values of number pairs for all texts
- values of number pairs for all texts without spaces($^1$)
- frequency values of number pairs for all texts ($^2$)
- frequency values of number pairs for all texts without spaces ($^1$)($^2$)

---
($^1$) Окрім стандартної кількісної табліці (літери та пробіл), додатково створюємо таблицю тільки для літер. Це пов'язане з тим, що кодований текст немає пробілів.

($^2$) Варто зауважити, що для ціх датафреймів (матриць) виникають проблеми з точністю.  Це проявляється в тому сенсі, що сума елементів частотних значень (нормочаних матриць) не дорівнює 1, а наприклад 1.0000000000000002. Тобто відрізняється від 1 на значення приблизно 1е-15.
В даному випадку в нас не так багато букв (2263891) в тестовому прикладі, тому це неповинно вплинути на результат. Але якщо робити частотну таблицю по набагато більш об'ємному набору текстів, то можуть виникнкти проблеми. (хоча і не факт). [Дивись прикдад в кінці ноутбуку `Example 1`]

In [14]:
freq_tables = {}

# DF for values of number pairs for all texts
freq_tables['Quantit for All'] = quantit_tables['All texts'].copy()
freq_tables['Quantit for All'].name = 'Quantit for All'

# DF for values of number pairs for all texts without spaces
freq_tables['Quantit without spaces'] = freq_tables['Quantit for All'].iloc[1:, 1:]
freq_tables['Quantit without spaces'].name = 'Quantit without spaces'

# DF for frequency of number pairs for all texts
freq_tables['Freq for All'] = freq_tables['Quantit for All'] / freq_tables['Quantit for All'].values.sum()
freq_tables['Freq for All'].name = 'Freq for All'

# DF for frequency of number pairs for all texts without spaces
freq_tables['Freq without spaces'] = freq_tables['Quantit without spaces'] / freq_tables['Quantit without spaces'].values.sum()
freq_tables['Freq without spaces'].name = 'Freq without spaces'


Save DataFrames with `quantitative` and `frequency` tables for `All texts` to disk (Goodle drive)

In [15]:
for name in freq_tables:
    path_pt = os.path.join(drch, folder_pt, name + '.csv')
    freq_tables[name].to_csv(path_pt, index_label=False)

---
## Creating ***biased*** tables

Now we create dict with ***Biased tables***.
Biased tables is a tables with bias, that is, we change the values in the table that are equal to 0 to 1.

Як можна бачити, пари букв з малим значенням кільності (1-3) доволі "дивні". Це викликано якоюсь аномалією в тексті: або слово дивного походження, або помилка. Тому ми можемо припустити, що всі "аномалії" зустрілися хоча б 1 раз, тобто зробити заміну кількості з 0 на 1.

Now we create follow dataframes that contain `biased tables` by `frequency tables`, namely:
- biased quantitativity table for all texts
- biased quantitativity table for all texts without spaces
- biased frequency table for all texts
- biased frequency table for all texts without spaces

In [16]:
biased_tables = {}

# DF for biased values (BVal) of number pairs for all texts
biased_tables['BQuant for All'] = (quantit_tables['All texts'].copy()).replace(0, 1)
biased_tables['BQuant for All'].name = 'BQuant for All'

# DF for values of number pairs for all texts without spaces
biased_tables['BQuant without spaces'] = biased_tables['BQuant for All'].iloc[1:, 1:]
biased_tables['BQuant without spaces'].name = 'BQuant without spaces'

# DF for frequency of number pairs for all texts
biased_tables['BFreq for All'] = biased_tables['BQuant for All'] / biased_tables['BQuant for All'].values.sum()
biased_tables['BFreq for All'].name = 'BFreq for All'

# DF for frequency of number pairs for all texts without spaces
biased_tables['BFreq without spaces'] = biased_tables['BQuant without spaces'] / biased_tables['BQuant without spaces'].values.sum()
biased_tables['BFreq without spaces'].name = 'BFreq without spaces'

Save this DataFrames with `biased tables` to dick.

In [17]:
for name in biased_tables:
    path_pt = os.path.join(drch, folder_pt, name + '.csv')
    biased_tables[name].to_csv(path_pt, index_label=False)

## Creating ***logarithmic*** tables
We got biased tables, but they contain values, which when multiplied many times give result, that more (quantitative) or less (frequency) then conputer know. Therefore, to avoid large (or small) values, we need make a `logarithmic` operation (on base `10`).

Now we create follow dataframes that contain `logarithmic tables` by `biased tables`, namely:
- logarithmic biased quantitativity table for all texts
- logarithmic biased quantitativity table for all texts without spaces
- logarithmic biased frequency table for all texts
- logarithmic biased frequency table for all texts without spaces

In [18]:
logarithmic_tables = {}

# DF for logarithmic biased values (Log BVal) of number pairs for all texts
logarithmic_tables['Log BQuant for All'] = biased_tables['BQuant for All'].applymap(np.log10)
logarithmic_tables['Log BQuant for All'].name = 'Log BQuant for All'

# DF for logarithmic biased values of number pairs for all texts without spaces
logarithmic_tables['Log BQuant without spaces'] = biased_tables['BQuant without spaces'].applymap(np.log10)
logarithmic_tables['Log BQuant without spaces'].name = 'Log BQuant without spaces'

# DF for logarithmic biased frequency of number pairs for all texts
logarithmic_tables['Log BFreq for All'] = biased_tables['BFreq for All'].applymap(np.log10)
logarithmic_tables['Log BFreq for All'].name = 'Log BFreq for All'

# DF for logarithmic biased frequency of number pairs for all texts without spaces
logarithmic_tables['Log BFreq without spaces'] = biased_tables['BFreq without spaces'].applymap(np.log10)
logarithmic_tables['Log BFreq without spaces'].name = 'Log BFreq without spaces'

In [19]:
for name in logarithmic_tables:
    path_pt = os.path.join(drch, folder_pt, name + '.csv')
    logarithmic_tables[name].to_csv(path_pt, index_label=False)

---
---
---