# Preproces for **test data**

##About

This notebook is designed to pre-process of the texts, what will use as test data. In more detail about preprocessing see in `Letters_pairs_preproc_2.ipynb`.




The following texts were used for test:


- Computer_articles: [1](https://www.projectpro.io/article/gpus-for-machine-learning/677), [2](https://www.tomshardware.com/news/should-i-buy-a-3d-printer-this-black-friday), [3](https://towardsdatascience.com/zero-etl-chatgpt-and-the-future-of-data-engineering-71849642ad9c)
- Chat_comments: [1](https://twitter.com/DrEliDavid/status/1737933931297595678), [2](https://www.youtube.com/watch?v=FUHkTs-Ipfg&ab_channel=Veritasium), [3](https://www.foxnews.com/politics/calls-grow-for-biden-to-denounce-colorados-removal-of-trump-from-2024-ballot-smartest-move)
- Jane_Eyre (Chapters 1-3): [here](https://www.gutenberg.org/files/1260/1260-h/1260-h.htm)
- Robert_Burns (1780-1781): [here](https://www.gutenberg.org/files/1279/1279-h/1279-h.htm)
- Disturbed_Immortalized (2015): [here](https://www.lyricsondemand.com/d/disturbedlyrics/immortalizedalbumlyrics.html)
- Advanced_Literature (Boston University: Upper-level Undergraduate Courses in Language and Literature): [here](https://www.bu.edu/english/undergraduate/courses/fall-2022-undergraduate-courses/)
- News_articles (Arutz Sheva): [1](https://www.israelnationalnews.com/news/382411), [2](https://www.israelnationalnews.com/news/382412), [3](https://www.israelnationalnews.com/news/382401), [4](https://www.israelnationalnews.com/news/382382)





---
---
## Import & mount

In [3]:
import os
import re
import copy
from time import time
import numpy as np
import pandas as pd

# from matplotlib import pyplot as plt
# plt.rcParams['figure.figsize'] = [15, 6]

In [4]:
# Mount GitHub
!git clone https://github.com/EdwardGerman/Columnar-Transposition-Cipher.git  # clone repository
%ls  # checking whether all files are present
drch = '/content/Columnar-Transposition-Cipher' # Path to data

folder_r = 'Data_test'
folder_w = 'Data_test_pp'

Mounted at /content/drive


---
---
## Funcfions

#### Space_n_letters()
Function for count of all letters and spaces in preproc text

In [5]:
def Space_n_Letters(pp_text, p = False):
    chars_after_PP = len(pp_text)
    spaces = pp_text.count(' ')
    letters = chars_after_PP - spaces

    if p == False:
        return [chars_after_PP, letters, spaces]
    else:
        print('Symbols number:',chars_after_PP)
        print('Spaces number:',spaces)
        print('Only letters: ', letters,'\n')
    # print('Only letters (2): ', len(list(filter(str.isalpha, pp_text))),'\n')

#### CharacterCounts()
Counts the number of each character in the text and print sorted [by decreasing] list (dictionary).

!Noties! optiotal:
* Upper letter = lower letter
* Multyspace -> space

In [6]:
# Returns sorted dictionary {letter : count} from largest to smallest number
def CharactersCount(orig_text, multispace = False, uppercase = False):
    text = copy.copy(orig_text)
    if multispace == False: text = re.sub(r' +', ' ', text)
    if uppercase  == False: text = text.lower()

    counts = {}   # Creating empty dictionary
    for char in text:    # Loop for count characters
        if char in counts: counts[char] += 1
        else:              counts[char]  = 1
    return dict(sorted(counts.items(), key=lambda item: item[1], reverse=True))

---
---
---
## Processing
Order is following:
- to lower letters;
- remove [apostrophe](https://en.wikipedia.org/wiki/Apostrophe) in the middle of the words;
- keep only letters and spaces (all rest convert to space);
- multispace to space;





In [7]:
def TextPreprocessing(orig_txt):
    pp_text = copy.copy(orig_txt)

    pp_text = pp_text.lower()                                   # to lower letters
    pp_text = re.sub(r"(?<=[a-z])[`'’](?=[a-z])", '', pp_text)  # remove `'’

    pp_text = re.sub(r'[^a-z ]', ' ', pp_text)  # only letters and spaces (all rest convert to space)
    pp_text = re.sub(r' +', ' ', pp_text)       # multyspace to space

    if pp_text.endswith(' '): pp_text = pp_text[:-1]

    return pp_text

---

Get list of all files in directory

In [8]:
# Get list of all files in directory
file_list = os.listdir(os.path.join(drch, folder_r))
file_list

['Computer_articles.txt',
 'Chat_comments.txt',
 'Jane_Eyre.txt',
 'Disturbed_Immortalized.txt',
 'Advanced_Literature.txt',
 'Robert_Burns.txt',
 'about.txt',
 'News_articles.txt']

In [9]:
file_list = ['Computer_articles.txt',
            'Chat_comments.txt',
            'Jane_Eyre.txt',
            'Disturbed_Immortalized.txt',
            'Advanced_Literature.txt',
            'Robert_Burns.txt',
            'News_articles.txt']
file_list

['Computer_articles.txt',
 'Chat_comments.txt',
 'Jane_Eyre.txt',
 'Disturbed_Immortalized.txt',
 'Advanced_Literature.txt',
 'Robert_Burns.txt',
 'News_articles.txt']

Creating DataFrames for statistics

In [10]:
chars_count_df = pd.DataFrame()                     # Df for count of all chars in texts
chars_count_pp_df = pd.DataFrame()                  # Df for count of chars in texts afrer preproc
                                                    # (only letters and spaces)
chars_number_df = pd.DataFrame(index=file_list,     # DF what contain chars number before & after PP
                               columns=['Chars before PP', 'Chars after PP', 'Letters', 'Spaces'])

Preprocessing in the loop:
- read text from file;
- count of chars in original text;
- exactly `Preprocessing`;
- count of chars in text preprocessing;
- write preprocessing text to file;



In [11]:
all_pp_text = ''

# Loop for each file:
for text_file in file_list:
    file_name = text_file.replace('.txt', '')   # Only file name, without extension

    # Read:
    path_r = os.path.join(drch, folder_r, text_file)
    with open(path_r, 'r', errors='ignore') as file:
        orig_text = file.read()

    # Count of characters in original text (before pp):
    ser = pd.DataFrame([CharactersCount(orig_text)]).T
    ser.columns = [file_name]
    chars_count_df = pd.concat([chars_count_df, ser], axis=1)

    # Preprocessing process:
    pp_text = TextPreprocessing(orig_text)

    # Count of characters in preproc text:
    ser = pd.DataFrame([CharactersCount(pp_text)]).T
    ser.columns = [file_name]
    chars_count_pp_df = pd.concat([chars_count_pp_df, ser], axis=1)

    # Write:
    path_w = os.path.join(drch, folder_w, file_name + '_pp.txt')
    with open(path_w, 'w') as file:
        file.write(pp_text)

    # Count:
    chars_before_PP = len(orig_text)
    chars_number_df.loc[text_file] = [chars_before_PP] + Space_n_Letters(pp_text)

    # Add preproc text to "All text" variable
    all_pp_text = all_pp_text + ' ' + pp_text

Finishing of formation statistic DataFrames

In [12]:
# --------------------------------------------------------------------------------------------------
# Finishing for "chars_count_df"
chars_count_df = chars_count_df.fillna(0)           # Change 'Nan' to '0'
chars_count_df = chars_count_df.astype(int)         # All values is integer

chars_count_df['Sum'] = chars_count_df.sum(axis=1)  # New column with sums for each chars in all files
chars_count_df = chars_count_df.sort_values(by='Sum', ascending=False)      # sort by count

# --------------------------------------------------------------------------------------------------
# Finishing for "chars_count_pp_df"
chars_count_pp_df = chars_count_pp_df.fillna(0)     # Change 'Nan' to '0'
chars_count_pp_df = chars_count_pp_df.astype(int)   # All values is integer

chars_count_pp_df['Sum'] = chars_count_pp_df.sum(axis=1)# New column with sums for each chars in all files
chars_count_pp_df = chars_count_pp_df.sort_values(by='Sum', ascending=False)# sort by count

# --------------------------------------------------------------------------------------------------
# Finishing for "chars_number_df"
chars_number_df.loc['All texts'] = chars_number_df.sum()

---

Write to text file all preprocessing texts

In [13]:
all_pp_text = all_pp_text[1:]
path_w = os.path.join(drch, folder_w, 'All_texts_test' + '_pp.txt')
with open(path_w, 'w') as file:
    file.write(all_pp_text)

## Statistics

#### 1. Characters number
- All chars number before preproc
- All chars number after preproc
- Letters number after preproc
- Spaces number after preproc

In [14]:
chars_number_df

Unnamed: 0,Chars before PP,Chars after PP,Letters,Spaces
Computer_articles.txt,46613,44691,37100,7591
Chat_comments.txt,82490,78370,63976,14394
Jane_Eyre.txt,49246,47130,38220,8910
Disturbed_Immortalized.txt,26480,25685,20628,5057
Advanced_Literature.txt,23126,21363,17929,3434
Robert_Burns.txt,13883,11468,9196,2272
News_articles.txt,5702,5390,4509,881
All texts,247540,234097,191558,42539


#### 2. Number of each character in each file `before preproc`

In [15]:
pd.set_option('display.max_rows', None) # Print all rows (features) in DF
display(chars_count_df)
pd.reset_option('display.max_rows')     # Default setting: print print only first & final 5 rows

Unnamed: 0,Computer_articles,Chat_comments,Jane_Eyre,Disturbed_Immortalized,Advanced_Literature,Robert_Burns,News_articles,Sum
,7352,13949,8570,4218,3390,2246,851,40576
e,4372,7627,4853,2817,2102,1201,549,23521
t,3186,6625,3207,1740,1523,823,361,17465
a,3244,4754,3091,1333,1420,799,437,15078
o,2582,5158,2824,1881,1342,591,272,14650
i,2762,5075,2749,1462,1599,584,387,14618
n,2587,4351,2612,1588,1425,656,302,13521
s,2420,3949,2472,985,1232,680,291,12029
r,2480,3724,2339,1144,1208,580,356,11831
h,1342,3291,2193,1042,641,633,232,9374


#### 3. Number of each character in each file `after preproc`

In [16]:
chars_count_pp_df['Sum'] = chars_count_pp_df.sum(axis=1)
chars_count_pp_df = chars_count_pp_df.sort_values(by='Sum', ascending=False)

chars_count_pp_df

Unnamed: 0,Computer_articles,Chat_comments,Jane_Eyre,Disturbed_Immortalized,Advanced_Literature,Robert_Burns,News_articles,Sum
,7591,14394,8910,5057,3434,2272,881,85078
e,4372,7627,4853,2817,2102,1201,549,47042
t,3186,6625,3207,1740,1523,823,361,34930
a,3244,4754,3091,1333,1420,799,437,30156
o,2582,5158,2824,1881,1342,591,272,29300
i,2762,5075,2749,1462,1599,584,387,29236
n,2587,4351,2612,1588,1425,656,302,27042
s,2420,3949,2472,985,1232,680,291,24058
r,2480,3724,2339,1144,1208,580,356,23662
h,1342,3291,2193,1042,641,633,232,18748


In [17]:
lst = sorted(chars_count_pp_df.index.tolist())
print(lst)
# print(chars_count_pp_df.index.tolist().sort())

[' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


---

---
---
##Hamlet

### Bli "BERNARDO:"

In [18]:
# drch = '/content/drive/MyDrive/DS small/Letters pairs/'

# path_r = os.path.join(drch, 'Data_add', 'Hamlet.txt')
# with open(path_r, 'r') as file:
#     orig_Hamlet = file.read()

# Hamlet_BB = re.sub(r'\b[A-Z]+\b:', '', orig_Hamlet)   # Regular expression for "BERNARDO:" etc.

# path_w = os.path.join(drch, 'Data_add', 'Hamlet_BB.txt')
# with open(path_w, 'w') as file:
#     file.write(Hamlet_BB)


---
---
---
#!!!
Draft version (this code don`t use in this project)

In [19]:
# text   = "When I see the title, I was scared that it would required a NSFW tag, lol! It's safe to assume that's why we didn't get to see the basement of the Inn Seeing all those rabbits tells me this is moon guard."
# text_1 = "When I see the tit3le, I was sca4red that it w5ould required a 1NSFW tag, lo6l! It's safe to as9sume that'0s why w2e didn't get to see the bas34ement of the Inn See5ing all6 7those r0a0b8b7its tel6ls me th7is is moon gua8rd9."
# text_2 = "When I see the title, I was scared that it would required a NSFW tag, lol! It's safe to assume that's why we didn't get to see the basement of the Inn Seeing all those rabbits tells me this is moon guard."


In [20]:
# text = "Це рядок 123 з кількома 4 цифрами в словах: b2a, a3bc, 1def, ghi4, eff56r, fgf34."
# # Використовуємо регулярний вираз для видалення цифр, які оточені буквами
# modified_text = re.sub(r'(?<=[a-zA-Z])\d(?=[a-zA-Z])', '', text)
# print(modified_text)

In [21]:
# text = "Це рядок: b`a, ab~c, d'e!f, ghi~, ef'fr, fgf'."

# modified_text = re.sub(r"(?<=[a-zA-Z])[`~!'’](?=[a-zA-Z])", '', text)
# print(modified_text)

In [22]:
# file_name = 'Torah_Bereshit_Bereshit.txt'
# folder = 'Data_add/'
# path = os.path.join(drch, folder, file_name)

# with open(path, 'r') as file:
#     orig_text = file.read()

# pp_text = copy.copy(orig_text)
# print(orig_text.count("’"))
# print(orig_text.count("'"))
# print(orig_text.count("‘"),'\n')


# pp_text = pp_text.lower()                                   # to lower letters
# pp_text = re.sub(r"(?<=[a-z])[`'’](?=[a-z])", '', pp_text)  # remove `'’
# print(pp_text)

# print(pp_text.count("’"))
# print(pp_text.count("'"))
# print(pp_text.count("‘"),'\n')

# file_name = 'Torah_Bereshit_Bereshit_pp.txt'
# folder = 'Data_add/'
# path = os.path.join(drch, folder, file_name)

# with open(path, 'w') as file:
#     file.write(pp_text)