# Preproces for **letters pairs** for "Hamlet"

##About
This notebook is a preproces for a problem of decoding "crypted" text on example of "Hamlet" text.
In this we shown step-by-step stage of text processing.

In more detail about preprocessing is the [Letters_pairs_preproc.ipynb](https://colab.research.google.com/drive/1eMOipvAFw1HhODPlyZQVYmTIZJawr8wt?hl=ua#scrollTo=L_FEXkUE8_1n&uniqifier=2)


---
---
## Import & mount

In [1]:
import os
import re
import copy
from time import time
import numpy as np


In [2]:
# Mount
# Mount GitHub
!git clone https://github.com/EdwardGerman/Columnar-Transposition-Cipher.git  # clone repository
%ls  # checking whether all files are present
drch = '/content/Columnar-Transposition-Cipher' # Path to data

folder = 'Data/'

Mounted at /content/drive


---
---
## Funcfions

#### Space_n_letters()
Function for count of all letters and spaces in preproc text

In [3]:
def Space_n_letters(pp_text):
    print('Symbols number:',len(pp_text))
    print('Spaces number:',pp_text.count(' '))
    print('Only letters: ', len(pp_text) - pp_text.count(' '))
    print('Only letters (2): ', len(list(filter(str.isalpha, pp_text))))

#### CharacterCounts()
Counts the number of each character in the text and print sorted [by decreasing] list (dictionary).

!Noties! optiotal:
* Upper letter = lower letter
* Multyspace -> space

In [4]:
# Returns sorted dictionary {letter : count} from largest to smallest number
def CharacterCounts(orig_text, multyspace = False, uppercase = False):
    text = copy.copy(orig_text)
    if multyspace == False: text = re.sub(r' +', ' ', text)
    if uppercase  == False: text = text.lower()

    counts = {}   # Creating empty dictionary
    for char in text:    # Loop for count characters
        if char in counts: counts[char] += 1
        else:              counts[char]  = 1
    return dict(sorted(counts.items(), key=lambda item: item[1], reverse=True))

##Hamlet

### Hamlet 1-50

In [5]:
folder = 'Data_add/'
file_name = 'hamlet_1_A1_Sc1_1_50.txt'
path = os.path.join(drch, folder, file_name)

In [6]:
# file_list = os.listdir(drch)
# print(file_list)

In [7]:
with open(path, 'r') as file:
    orig_txt = file.read()

print(orig_txt)

Hamlet, Prince of Denmark
Complete Text With Definitions of Difficult Words and Explanations of Difficult Passages
 
Annotated by Michael J. Cummings


Home Page: Shakespeare Index             The Hamlet Study Guide






Introduction


The following version of Hamlet is based on the text in the authoritative 1914 Oxford Edition of Shakespeare's works, edited by W. J. Craig. The text numbers the lines, including those with stage directions such as "Enter" and "Exit." Annotations (notes and definitions) in the text of the play appear in brackets in boldfaced type.


Characters


Act 1, Scene 1: Elsinore. A platform before the castle. [A floor surrounded by battlements]
Act 1, Scene 2: A room of state in the castle.
Act 1, Scene 3: A room in the house of Polonius.
Act 1, Scene 4: A platform before the castle.
Act 1, Scene 5: Another part of the platform.


Act 2, Scene 1: A room in the house of Polonius.
Act 2, Scene 2: A room in the castle.


Act 3, Scene 1: A room in the castle.
Act 3,

####Remove all excess
pp_text - is a value for PreProcessing text

Remove "`BERNARDO:`" etc.:

In [8]:
pattern = r'\b[A-Z]+\b:'  # Regular expression for "BERNARDO:" etc.
pp_text = re.sub(pattern, '', orig_txt)

print(pp_text)

Hamlet, Prince of Denmark
Complete Text With Definitions of Difficult Words and Explanations of Difficult Passages
 
Annotated by Michael J. Cummings


Home Page: Shakespeare Index             The Hamlet Study Guide






Introduction


The following version of Hamlet is based on the text in the authoritative 1914 Oxford Edition of Shakespeare's works, edited by W. J. Craig. The text numbers the lines, including those with stage directions such as "Enter" and "Exit." Annotations (notes and definitions) in the text of the play appear in brackets in boldfaced type.


Characters


Act 1, Scene 1: Elsinore. A platform before the castle. [A floor surrounded by battlements]
Act 1, Scene 2: A room of state in the castle.
Act 1, Scene 3: A room in the house of Polonius.
Act 1, Scene 4: A platform before the castle.
Act 1, Scene 5: Another part of the platform.


Act 2, Scene 1: A room in the house of Polonius.
Act 2, Scene 2: A room in the castle.


Act 3, Scene 1: A room in the castle.
Act 3,

`!`, `?` & `;` to `.`:

In [9]:
pp_text = re.sub(r'[!?;]', '.', pp_text)
print(pp_text)

Hamlet, Prince of Denmark
Complete Text With Definitions of Difficult Words and Explanations of Difficult Passages
 
Annotated by Michael J. Cummings


Home Page: Shakespeare Index             The Hamlet Study Guide






Introduction


The following version of Hamlet is based on the text in the authoritative 1914 Oxford Edition of Shakespeare's works, edited by W. J. Craig. The text numbers the lines, including those with stage directions such as "Enter" and "Exit." Annotations (notes and definitions) in the text of the play appear in brackets in boldfaced type.


Characters


Act 1, Scene 1: Elsinore. A platform before the castle. [A floor surrounded by battlements]
Act 1, Scene 2: A room of state in the castle.
Act 1, Scene 3: A room in the house of Polonius.
Act 1, Scene 4: A platform before the castle.
Act 1, Scene 5: Another part of the platform.


Act 2, Scene 1: A room in the house of Polonius.
Act 2, Scene 2: A room in the castle.


Act 3, Scene 1: A room in the castle.
Act 3,

`\n` to ` `:

In [10]:
pp_text = pp_text.replace('\n', ' ')
print(pp_text)

Hamlet, Prince of Denmark Complete Text With Definitions of Difficult Words and Explanations of Difficult Passages   Annotated by Michael J. Cummings   Home Page: Shakespeare Index             The Hamlet Study Guide       Introduction   The following version of Hamlet is based on the text in the authoritative 1914 Oxford Edition of Shakespeare's works, edited by W. J. Craig. The text numbers the lines, including those with stage directions such as "Enter" and "Exit." Annotations (notes and definitions) in the text of the play appear in brackets in boldfaced type.   Characters   Act 1, Scene 1: Elsinore. A platform before the castle. [A floor surrounded by battlements] Act 1, Scene 2: A room of state in the castle. Act 1, Scene 3: A room in the house of Polonius. Act 1, Scene 4: A platform before the castle. Act 1, Scene 5: Another part of the platform.   Act 2, Scene 1: A room in the house of Polonius. Act 2, Scene 2: A room in the castle.   Act 3, Scene 1: A room in the castle. Act 3,

All letters is small:

In [11]:
pp_text = pp_text.lower()
print(pp_text)

hamlet, prince of denmark complete text with definitions of difficult words and explanations of difficult passages   annotated by michael j. cummings   home page: shakespeare index             the hamlet study guide       introduction   the following version of hamlet is based on the text in the authoritative 1914 oxford edition of shakespeare's works, edited by w. j. craig. the text numbers the lines, including those with stage directions such as "enter" and "exit." annotations (notes and definitions) in the text of the play appear in brackets in boldfaced type.   characters   act 1, scene 1: elsinore. a platform before the castle. [a floor surrounded by battlements] act 1, scene 2: a room of state in the castle. act 1, scene 3: a room in the house of polonius. act 1, scene 4: a platform before the castle. act 1, scene 5: another part of the platform.   act 2, scene 1: a room in the house of polonius. act 2, scene 2: a room in the castle.   act 3, scene 1: a room in the castle. act 3,

Leave only `letters`, `spaces` and '`.`'

In [12]:
pp_text = re.sub(r'[^a-z .]', '', pp_text)
print(pp_text)

hamlet prince of denmark complete text with definitions of difficult words and explanations of difficult passages   annotated by michael j. cummings   home page shakespeare index             the hamlet study guide       introduction   the following version of hamlet is based on the text in the authoritative  oxford edition of shakespeares works edited by w. j. craig. the text numbers the lines including those with stage directions such as enter and exit. annotations notes and definitions in the text of the play appear in brackets in boldfaced type.   characters   act  scene  elsinore. a platform before the castle. a floor surrounded by battlements act  scene  a room of state in the castle. act  scene  a room in the house of polonius. act  scene  a platform before the castle. act  scene  another part of the platform.   act  scene  a room in the house of polonius. act  scene  a room in the castle.   act  scene  a room in the castle. act  scene  a hall in the castle. act  scene  a room in

Multispace to space:

In [13]:
pp_text = re.sub(r' +', ' ', pp_text)
print(pp_text)

hamlet prince of denmark complete text with definitions of difficult words and explanations of difficult passages annotated by michael j. cummings home page shakespeare index the hamlet study guide introduction the following version of hamlet is based on the text in the authoritative oxford edition of shakespeares works edited by w. j. craig. the text numbers the lines including those with stage directions such as enter and exit. annotations notes and definitions in the text of the play appear in brackets in boldfaced type. characters act scene elsinore. a platform before the castle. a floor surrounded by battlements act scene a room of state in the castle. act scene a room in the house of polonius. act scene a platform before the castle. act scene another part of the platform. act scene a room in the house of polonius. act scene a room in the castle. act scene a room in the castle. act scene a hall in the castle. act scene a room in the castle. act scene the queens apartment. act scen

---
---

#### Statistics

In [14]:
fs_text = re.sub(r'\b[A-Z]+\b:', '', orig_txt)
fs_text = re.sub(r' +', ' ', fs_text)
fs_text = fs_text.lower()
# print(fs_text)

In [15]:
CharacterCounts(fs_text)

{' ': 603,
 'e': 324,
 't': 252,
 'a': 227,
 'o': 202,
 'n': 171,
 's': 163,
 'i': 158,
 'h': 146,
 'r': 140,
 'l': 113,
 '\n': 111,
 'c': 104,
 'd': 79,
 'm': 62,
 'f': 62,
 'u': 60,
 '.': 59,
 ',': 50,
 'p': 41,
 'g': 41,
 'w': 39,
 'y': 36,
 ':': 27,
 'b': 26,
 'k': 19,
 '1': 16,
 'v': 14,
 '4': 13,
 '[': 11,
 '!': 11,
 ']': 10,
 '5': 10,
 'x': 9,
 '2': 9,
 '3': 9,
 '’': 9,
 ';': 8,
 '?': 7,
 '0': 5,
 '-': 5,
 '"': 4,
 'j': 3,
 "'": 2,
 'q': 2,
 '\ufeff': 1,
 '9': 1,
 '(': 1,
 ')': 1,
 '6': 1,
 '7': 1,
 '—': 1}

In [16]:
new_text = re.sub(r'\b[A-Z]+\b:', '', orig_txt)
CharacterCounts(new_text, multyspace = True, uppercase = True)

{' ': 787,
 'e': 311,
 't': 237,
 'o': 197,
 'a': 174,
 'n': 167,
 'i': 149,
 'r': 136,
 's': 134,
 'h': 130,
 '\n': 111,
 'l': 108,
 'c': 99,
 'd': 71,
 'u': 59,
 '.': 59,
 'm': 57,
 'f': 57,
 'A': 53,
 ',': 50,
 'g': 38,
 'p': 35,
 'y': 34,
 'S': 29,
 ':': 27,
 'w': 24,
 'b': 21,
 'k': 19,
 'H': 16,
 '1': 16,
 'T': 15,
 'W': 15,
 'v': 14,
 'E': 13,
 '4': 13,
 '[': 11,
 '!': 11,
 ']': 10,
 '5': 10,
 'x': 9,
 'I': 9,
 '2': 9,
 '3': 9,
 '’': 9,
 'D': 8,
 ';': 8,
 '?': 7,
 'P': 6,
 'C': 5,
 'M': 5,
 'O': 5,
 'F': 5,
 'B': 5,
 'L': 5,
 '0': 5,
 '-': 5,
 '"': 4,
 'R': 4,
 'N': 4,
 'G': 3,
 'J': 2,
 "'": 2,
 'q': 2,
 'Y': 2,
 '\ufeff': 1,
 '9': 1,
 '(': 1,
 ')': 1,
 '6': 1,
 '7': 1,
 'U': 1,
 'j': 1,
 '—': 1}

### For all "Hamlet"

In [17]:
file_name = 'Hamlet.txt'
folder = 'Data/'
path = os.path.join(drch, folder, file_name)

with open(path, 'r') as file:
    orig_txt = file.read()

First try of text processing (left for history):

In [18]:
pp_text = copy.copy(orig_txt)
pp_text = re.sub(r'\b[A-Z]+\b:', '', pp_text)   # Regular expression for "BERNARDO:" etc.
pp_text = re.sub(r'[!?;]', '.', pp_text)        # `!`, `?` & `;` -> `.`
pp_text = pp_text.replace('\n', ' ')            # '\n' -> ' '
pp_text = pp_text.lower()                       # to lower letters
pp_text = re.sub(r'[^a-z ]', ' ', pp_text)      # only letters and spaces
pp_text = re.sub(r' +', ' ', pp_text)
# pp_text = re.sub(r'[^a-z .]', '', pp_text)      # only letters, spaces and '.'

Space_n_letters(pp_text)

Symbols number: 211800
Spaces number: 41644
Only letters:  170156
Only letters (2):  170156


In [19]:
# For stat:
fs_text = copy.copy(orig_txt)
fs_text = re.sub(r'\b[A-Z]+\b:', '', orig_txt) # only for "Hamlet" (look above)
fs_text = re.sub(r' +', ' ', fs_text)
fs_text = fs_text.lower()

CharacterCounts(fs_text)

{' ': 41949,
 'e': 20205,
 't': 15720,
 'o': 14297,
 'a': 12924,
 'i': 11417,
 'n': 11086,
 's': 11057,
 'h': 10512,
 'r': 10063,
 'l': 7583,
 'd': 6731,
 'u': 5529,
 'm': 5027,
 'y': 4093,
 '\n': 4034,
 'w': 4019,
 'c': 3768,
 ',': 3703,
 'f': 3693,
 'g': 3262,
 '.': 2804,
 'p': 2715,
 'b': 2463,
 'k': 1630,
 'v': 1620,
 '[': 1158,
 ']': 1039,
 '’': 971,
 ';': 839,
 ':': 624,
 '?': 457,
 '5': 429,
 '!': 401,
 '0': 399,
 '1': 355,
 'x': 253,
 '-': 208,
 '2': 206,
 'j': 192,
 'q': 184,
 '—': 146,
 '3': 141,
 "'": 130,
 'z': 113,
 '4': 111,
 '6': 72,
 '(': 71,
 ')': 71,
 '7': 68,
 '8': 62,
 '9': 58,
 '"': 45,
 '‘': 39,
 '&': 5,
 'æ': 2}

In [20]:
# Write to file text after preprocessing
file_name = 'hamlet_1_pp.txt'
path = os.path.join(drch,  file_name)

with open(path, 'w') as file:
    file.write(pp_text)

---
---
---
## Bli "BERNARDO:"

In [None]:
drch = '/content/drive/MyDrive/DS small/Letters pairs/'

path_r = os.path.join(drch, 'Data_add', 'Hamlet.txt')
with open(path_r, 'r') as file:
    orig_Hamlet = file.read()

Hamlet_BB = re.sub(r'\b[A-Z]+\b:', '', orig_Hamlet)   # Regular expression for "BERNARDO:" etc.

path_w = os.path.join(drch, 'Data_add', 'Hamlet_BB.txt')
with open(path_w, 'w') as file:
    file.write(Hamlet_BB)