# Preproces for **letters pairs**

##About
There is a task of decoding "crypted" text, which is based on the number of occurrences of pairs of letters in "natural language text".

This notebook is designed to pre-process the text to meet the requirements of the coding task, namely:
- the text contains only small letters and a space;
- all words write successively on one line - there are no paragraphs (\n) and other hidden characters.

Additional text processing is possible, which depends directly on the properties of a specific type of text.

---
Є задача декодування "зашифрованого" тексту, яка базується на кількості частоти входжень пар літер в "natural language text".

Цей ноутбук призначений для попередньої обробки тексту, щоб відповідати умовам задачі кодування, а саме:
- текст містить тільки маленьки літери та пробіл;
- всі слова їдуть підряд в один рядок -  відсутні абзаци (\n) та інші сховані символи.

Можлива додаткова обробка тексту, яка залежить беспосередньо від особливостей конкретного типу тексту.



The following texts were used:
- (1 HS) "Hamlet" by Shakespeare ([here](https://shakespearestudyguide.com/Hamlet%20Text.html))
- (2 TB) Torah, Sefer Bereshit in English ([here](https://www.tanach.us/Pages/About.html))
- (3 AR) Asimov "I, Robot"  ([here](https://royallib.com/book/Asimov_Isaac/I_Robot.html))
- (4 HP) "Harry Potter and the Philosopher's Stone" ([here](https://github.com/amephraim/nlp/blob/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%201%20-%20Sorcerer's%20Stone.txt))
- (5 GG) Genetics : From Genes to Genomes. 6th edition ([here](http://skgjx.whu.edu.cn/Public/upfile/article/202103031656469260.pdf)) [Intro & parts 1,2,4. My edit from PDF]
- (6 SA) Scientific articles: from IEEE, [medium.com](https://medium.com/) etc.
- (7 NA) News articles: [New York Times](https://www.nytimes.com/international/), [Fox News](https://www.foxnews.com/), [BBC](https://www.bbc.com/news) etc.
- (8 CC) Coments in chats








---
---
## Import & mount

In [1]:
import os
import re
import copy
from time import time
import numpy as np

# from matplotlib import pyplot as plt
# plt.rcParams['figure.figsize'] = [15, 6]

In [2]:
# Mount
from google.colab import drive
drive.mount('/content/drive')
# Path to data
drch = '/content/drive/MyDrive/DS small/Letters pairs/'
folder = 'Data/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


---
---
## Funcfions

#### Space_n_letters()
Function for count of all letters and spaces in preproc text

In [3]:
def Space_n_letters(pp_text):
    print('Symbols number:',len(pp_text))
    print('Spaces number:',pp_text.count(' '))
    print('Only letters: ', len(pp_text) - pp_text.count(' '))
    print('Only letters (2): ', len(list(filter(str.isalpha, pp_text))),'\n')

#### CharacterCounts()
Counts the number of each character in the text and print sorted [by decreasing] list (dictionary).

!Noties! optiotal:
* Upper letter = lower letter
* Multyspace -> space

In [4]:
# Returns sorted dictionary {letter : count} from largest to smallest number
def CharacterCounts(orig_text, multispace = False, uppercase = False):
    text = copy.copy(orig_text)
    if multispace == False: text = re.sub(r' +', ' ', text)
    if uppercase  == False: text = text.lower()

    counts = {}   # Creating empty dictionary
    for char in text:    # Loop for count characters
        if char in counts: counts[char] += 1
        else:              counts[char]  = 1
    return dict(sorted(counts.items(), key=lambda item: item[1], reverse=True))

---
---
##Hamlet

### Hamlet 1-50

In more detail about preprocessing of this text part look here: [Letters_pairs_preproc_Hamlet.ipynb](https://colab.research.google.com/drive/1nMsjBDBKoNnbY7QjcW4tTk2ZPfr7kBvM#scrollTo=L_FEXkUE8_1n)


### All "Hamlet"

In [5]:
file_name_HS = 'Hamlet.txt'
folder = 'Data/'
path = os.path.join(drch, folder, file_name_HS)

with open(path, 'r') as file:
    orig_txt_HS = file.read()

In [6]:
# Preprocessing
pp_text = copy.copy(orig_txt_HS)

pp_text = re.sub(r'\b[A-Z]+\b:', '', pp_text)   # Regular expression for "BERNARDO:" etc.

pp_text = pp_text.lower()                       # to lower letters
pp_text = re.sub(r'[^a-z ]', ' ', pp_text)      # only letters and spaces
pp_text = re.sub(r' +', ' ', pp_text)           # delete multyspace

In [7]:
# Write to file text after preprocessing
path = os.path.join(drch, re.sub(r'\.txt', '_pp.txt', file_name_HS))

with open(path, 'w') as file:
    file.write(pp_text)

In [None]:
# For stat:
Space_n_letters(pp_text)

fs_text = copy.copy(orig_txt_HS)
fs_text = re.sub(r'\b[A-Z]+\b:', '', fs_text) # only for "Hamlet" (look above)
CharacterCounts(fs_text)

---
---
##Torah. Sefer Bereshit

####Only parasha Bereshit

In [9]:
file_name = 'Torah_Bereshit_Bereshit.txt'
folder = 'Data_add/'
path = os.path.join(drch, folder, file_name)

with open(path, 'r') as file:
    orig_txt = file.read()

Comment 1: pattern `'s` on the end of words we remove.

In [10]:
pp_text = copy.copy(orig_txt)

pp_text = re.sub("'s", '', pp_text)         # remove `'s`

pp_text = pp_text.lower()                   # to lower letters
pp_text = re.sub(r'[^a-z ]', ' ', pp_text)  # only letters and spaces (all rest convert to space)
pp_text = re.sub(r' +', ' ', pp_text)       # multispace to space

print(pp_text)

genesis parashat bereishit the seven days of creation in the beginning god created the heavens and the earth now the earth was formless and empty with darkness over the surface of the deep and god wind was hovering over the surface of the waters god said let there be light and there was light god saw that the light was good and god divided the light from the darkness god called the light day and the darkness he called night there was evening and there was morning a first day p god said let there be an expanse in the middle of the waters and let it divide water from water god made the expanse and divided the water which was under the expanse from the water which was above the expanse and it was so god called the expanse heaven there was evening and there was morning a second day p god said let the waters under the sky be gathered to one place and let the dry land appear and it was so god called the dry land earth and the gathering of the waters he called seas and god saw that it was goo

####Bereshit

In [11]:
file_name_TB = 'Torah_Bereshit.txt'
folder = 'Data/'
path = os.path.join(drch, folder, file_name_TB)

with open(path, 'r') as file:
    orig_txt_TB = file.read()

In [12]:
pp_text = copy.copy(orig_txt_TB)

pp_text = re.sub("'s", '', pp_text)         # remove `'s`

pp_text = pp_text.lower()                   # to lower letters
pp_text = re.sub(r'[^a-z ]', ' ', pp_text)  # only letters and spaces (all rest convert to space)
pp_text = re.sub(r' +', ' ', pp_text)       # multispace to space

# Space_n_letters(pp_text)

In [13]:
# Write to file text after preprocessing
path = os.path.join(drch, re.sub(r'\.txt', '_pp.txt', file_name_TB))

with open(path, 'w') as file:
    file.write(pp_text)

In [None]:
# For stat:
Space_n_letters(pp_text)

fs_text = copy.copy(orig_txt_TB)
fs_text = re.sub("'s", '', fs_text)         # remove `'s`

CharacterCounts(fs_text)

---
---
##Asimov "I, robot"

#### Only "Introduction"

In [15]:
file_name = 'Asimov_I_Robot_Intro.txt'
folder = 'Data_add/'
path = os.path.join(drch, folder, file_name)

with open(path, 'r', errors='ignore') as file:
    orig_txt = file.read()
# print(orig_txt)

In [16]:
pp_text = copy.copy(orig_txt)

pp_text = re.sub("'s", '', pp_text)         # remove `'s`

pp_text = pp_text.lower()                   # to lower letters
pp_text = re.sub(r'[^a-z ]', ' ', pp_text)  # only letters and spaces (all rest convert to space)
pp_text = re.sub(r' +', ' ', pp_text)       # multyspace to space

print(pp_text)

isaac asimov i robot introduction i looked at my notes and i didn t like them i d spent three days at u s robots and might as well have spent them at home with the encyclopedia tellurica susan calvin had been born in the year they said which made her seventy five now everyone knew that appropriately enough u s robot and mechanical men inc was seventy five also since it had been in the year of dr calvin birth that lawrence robertson had first taken out incorporation papers for what eventually became the strangest industrial giant in man history well everyone knew that too at the age of twenty susan calvin had been part of the particular psycho math seminar at which dr alfred lanning of u s robots had demonstrated the first mobile robot to be equipped with a voice it was a large clumsy unbeautiful robot smelling of machine oil and destined for the projected mines on mercury but it could speak and make sense susan said nothing at that seminar took no part in the hectic discussion period t

#### "I, Robot", all text

In [17]:
file_name_AR = 'Asimov_I_Robot.txt'
folder = 'Data/'
path = os.path.join(drch, folder, file_name_AR)

with open(path, 'r',  errors='ignore') as file:
    orig_txt_AR = file.read()

In [18]:
pp_text = copy.copy(orig_txt_AR)

pp_text = re.sub("'s", '', pp_text)         # remove `'s`

pp_text = pp_text.lower()                   # to lower letters
pp_text = re.sub(r'[^a-z ]', ' ', pp_text)  # only letters and spaces (all rest convert to space)
pp_text = re.sub(r' +', ' ', pp_text)       # multyspace to space

In [19]:
# Write to file text after preprocessing
path = os.path.join(drch, re.sub(r'\.txt', '_pp.txt', file_name_AR))

with open(path, 'w') as file:
    file.write(pp_text)

In [None]:
Space_n_letters(pp_text)

fs_text = copy.copy(orig_txt_AR)
fs_text = re.sub("'s", '', fs_text)         # remove `'s`

CharacterCounts(fs_text)

---
---
## Harry Potter

In [44]:
file_name_HP = 'Harry_Potter_I.txt'
folder = 'Data/'
path = os.path.join(drch, folder, file_name_HP)

with open(path, 'r',  errors='ignore') as file:
    orig_txt_HP = file.read()

# print(orig_txt_HP[:1000])

In [45]:
pp_text = copy.copy(orig_txt_HP)

pp_text = re.sub("'s", '', pp_text)         # remove `'s`

pp_text = pp_text.lower()                   # to lower letters
pp_text = re.sub(r'[^a-z ]', ' ', pp_text)  # only letters and spaces (all rest convert to space)
pp_text = re.sub(r' +', ' ', pp_text)       # multyspace to space

In [46]:
# Write to file text after preprocessing
path = os.path.join(drch, re.sub(r'\.txt', '_pp.txt', file_name_HP))

with open(path, 'w') as file:
    file.write(pp_text)

In [47]:
Space_n_letters(pp_text)

fs_text = copy.copy(orig_txt_HP)
fs_text = re.sub("'s", '', fs_text)         # remove `'s`

CharacterCounts(fs_text)

Symbols number: 413630
Spaces number: 79624
Only letters:  334006
Only letters (2):  334006 



{' ': 70803,
 'e': 39915,
 't': 29048,
 'a': 26590,
 'o': 26141,
 'h': 22531,
 'n': 21825,
 'r': 21650,
 'i': 20815,
 's': 18713,
 'd': 16617,
 'l': 14594,
 '\n': 10702,
 'u': 9755,
 'y': 8619,
 'g': 8619,
 'w': 8397,
 'm': 7394,
 'f': 6857,
 'c': 6696,
 '.': 6136,
 ',': 5658,
 'p': 5548,
 'b': 5328,
 '"': 4758,
 'k': 4009,
 'v': 2908,
 "'": 2140,
 '-': 1990,
 '?': 754,
 '!': 474,
 'q': 420,
 'x': 383,
 'j': 370,
 'z': 264,
 ';': 135,
 ':': 69,
 ')': 33,
 '(': 30,
 '1': 11,
 '3': 8,
 '4': 6,
 '0': 5,
 '7': 4,
 '9': 4,
 '2': 3,
 '\t': 3,
 '5': 2,
 '*': 2,
 '~': 1,
 '8': 1,
 '6': 1,
 '\\': 1}

---
---
## Genetics : From Genes to Genomes



In [52]:
file_name_GG = 'Genetics_124.txt'
folder = 'Data/'
path = os.path.join(drch, folder, file_name_GG)

with open(path, 'r',  errors='ignore') as file:
    orig_txt_GG = file.read()

# print(orig_txt_GG[-10000:])

In [53]:
pp_text = copy.copy(orig_txt_GG)

pp_text = re.sub("'s", '', pp_text)         # remove `'s`

pp_text = pp_text.lower()                   # to lower letters
pp_text = re.sub(r'[^a-z ]', ' ', pp_text)  # only letters and spaces (all rest convert to space)
pp_text = re.sub(r' +', ' ', pp_text)       # multyspace to space

In [54]:
# Write to file text after preprocessing
path = os.path.join(drch, re.sub(r'\.txt', '_pp.txt', file_name_GG))

with open(path, 'w') as file:
    file.write(pp_text)

In [None]:
Space_n_letters(pp_text)

fs_text = copy.copy(orig_txt_GG)
fs_text = re.sub("'s", '', fs_text)         # remove `'s`

CharacterCounts(fs_text)

---
---
## Scientific articles

In [96]:
file_name_SA = 'Sci_articles.txt'
folder = 'Data/'
path = os.path.join(drch, folder, file_name_SA)

with open(path, 'r') as file:
    orig_txt_SA = file.read()

# print(orig_txt_SA[:10000])

In [97]:
pp_text = copy.copy(orig_txt_GG)

pp_text = re.sub("'s", '', pp_text)         # remove `'s`

pp_text = pp_text.lower()                   # to lower letters
pp_text = re.sub(r'[^a-z ]', ' ', pp_text)  # only letters and spaces (all rest convert to space)
pp_text = re.sub(r' +', ' ', pp_text)       # multyspace to space

In [98]:
# Write to file text after preprocessing
path = os.path.join(drch, re.sub(r'\.txt', '_pp.txt', file_name_SA))

with open(path, 'w') as file:
    file.write(pp_text)

In [None]:
Space_n_letters(pp_text)

fs_text = copy.copy(orig_txt_SA)
print(CharacterCounts(fs_text))
print(fs_text.count("'s"))
print('-------------------------------')

fs_text = copy.copy(orig_txt_SA)
fs_text = re.sub("'s", '', fs_text)         # remove `'s`
print(fs_text.count("'s"))
CharacterCounts(fs_text)

In [105]:
fs_text = copy.copy(orig_txt_SA)
CharacterCounts(fs_text)
print(fs_text.count("’s"))
print(fs_text.count("'s"))
print(fs_text.count("‘s"),'\n')

print(fs_text.count("’"))
print(fs_text.count("'"))
print(fs_text.count("‘"),'\n')

print(fs_text.count("’s "))
print(fs_text.count("'s "))
print(fs_text.count("‘s "),'\n')

fs_text = re.sub("'s", '', fs_text)         # remove `'s`
print(fs_text.count("’s"))
print(fs_text.count("'s"))
print(fs_text.count("‘s"),'\n')

print(fs_text.count("’"))
print(fs_text.count("'"))
print(fs_text.count("‘"),'\n')

93
14
0 

295
102
143 

89
13
0 

93
0
0 

295
88
143 

