<center>    
    <h1 id='spacy-notebook-0' style='color:#7159c1; font-size:350%'>Text Manipulation</h1>
    <i style='font-size:125%'>Basics of text Manipulation</i>
</center>

> **Topics**

```
- 📜 Types of Texts
- 📦 Text Encodings
- 🔆 Optical Character Recognition (OCR)
- 🏷️ Processing Multiple Files
```

<h1 id='0-types-of-texts' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📜 | Types of Texts</h1>

Computers can store texts in three different formats: `Rich Text`, `Structured Text` and `Plain Text`, being:

- **Richt Text** - `texts that may contain formatting for better visualization, such as font sizes, bold, italic, images, tables, plots and much more. They are differently from Structured Texts, so instead of having the term wrapper by signgs that tell the text must be formatted, the formatting rules are inserted into the binaries of the Document. Examples: Microsoft Word and Libre Office Writer files`;
  
- **Structured Text** - `texts that may contain formatting for better visualzation, such as font sizes, bold, italic, images, tables, plots and much more. However, unlikely Rich Text, the terms should be wrapped by signs in order to tell the text must be formmated, so, we can say the rules are inserted into the text itself. Examples: HTML, XML, LaTeX and Markdown files`;
  
- **Plain Text** - `unlikely Rich and Structured Texts, Plain Texts don't contain any kind of formatting, that is, the text is the default as-is, without any different visual appearance and consisting of only letters, numbers, symbols, punctuations, spaces and line breaks. This is the best type of text to work with NLP. Examples: TXT, CSV and CoNLL-U Files`.

Since Plain Texts are the best ones to work with not only NLP in Python, but with any Programming Langauge, and we normally stumble upon Rich and Structured Texts when scrapping samples for NLP projects; we must be able to convert Rich and Structured Texts into Plain Texts, as well as deal with traces left by conversion from them.

It happens specially when our dataset has been extracted via `Optical Character Recognition (OCR)`.

<h1 id='1-text-encodings' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📦 | Text Encodings</h1>

Besides, all texts must be encoded in order to be read by computers, that is, each character from the text must have a respective number representation of itself understood by the computer. This number representation can be a decimal, binary or hexadecimal, being the last one more common.

Currently, the two main encoding types are `ASCII` and `Unicode`, where:

- **ASCII (American Standard Code for Information Interchange)** - `it's the first well succeeded text encoding rule, representing the characters by one byte and the numerical representation goes from 0 to 127. Even though ASCII fits very well in English, it cannot be used in all texts since it does not encode characters with accentuation and some special characters available in other languages, for instance, Portuguese ('ú', 'õ', 'ê', 'ç' and 'à')`;

- **Unicode** - `it has been created to solve the accentuation and special characters problem encountered when working with ASCII. It's been widely applied to encode a variety of texts nowadays, being able to encode more than 1,100,000.00 characters, such as letters, symbols and even emojis!! An important thing about Unicode is that it can be implemented in different rules accordingly to the target language, being 'UTF-8' the most general and used one. Just for curiosity, Unicode codes are from U+000000 to U+10FFFF, being the first two digits decimals, and the last four, hexadecimals`.

Thus, when dealing with a variety of texts and languages, it's recommended to work with `Plain Texts` encoded in `Unicode` format, more especifically in `UTF-8` rather than `UTF-16` and `UTF-32`.

Besides, it's important to have in mind that ASCII is a subset of UTF-8, so we can encode an ASCII text to UTF-8, but we cannot do the opposite.

<h1 id='2-optical-character-recognition-ocr' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🔆 | Optical Character Recognition (OCR)</h1>

Without getting into details, `Optical Character Recognition (OCR)` is a technique of extracting texts from documents by scanning them. It's often used in NLP when we need to digitalize documents in order to process their contents.

Sometimes, we can stumble upon with some issues when dealing with OCR, such as characters recognized wrongly, repeated characters and recognized characters that don't exist in the original document. These things are called `dirty` and it's our job to clean the digitalized texts!!

Let's do an exercise cleaning a text file generated via OCR.

In [33]:
# Checking Encoding
import chardet
from pathlib import Path

DATASETS_PATH = Path('datasets')
guessed_encoding, detection_result = None, None

# Modes:
#   - 'with' blocks automatically close the file at their ending;
#   - 'mode' parameter is opcional and can receive the following values:
#       \ 't': text mode (default);
#       \ 'b': binary mode;
#       \ 'r': reading;
#       \ 'w': overwriting;
#       \ 'a': appending;
#       \ 'x': open for exclusive creation, throwing an error if the file already exists;
#       \ '+': open for updating (reading and writing).
#
with open(f'{DATASETS_PATH}/NYT_1991-01-16-A15.txt', mode='rb') as file:
    detection_result = chardet.detect(file.read(100_000))
    guessed_encoding = detection_result['encoding']
    print(detection_result)

{'encoding': 'UTF-8-SIG', 'confidence': 1.0, 'language': ''}


In [34]:
# Reading the File
text_content = None

with open(f'{DATASETS_PATH}/NYT_1991-01-16-A15.txt', mode='r', encoding=guessed_encoding) as file:
    text_content = file.read()

print(f'- File Object: {file}')
print(f'- File Content: {text_content}')

- File Object: <_io.TextIOWrapper name='datasets/NYT_1991-01-16-A15.txt' mode='r' encoding='UTF-8-SIG'>
- File Content: U.S. TAKING STEPS TO CURB TERRORISM: F.B.I. Is Ordered to Find Iraqis Whose Visas Have Expired
By JAMES BARRON
New York Times (1923-Current file); Jan 16, 1991;
ProQuest Historical Newspapers: The New York Times with Index pg. A15
U.S. TAKING STEPS TO CURB TERRORISM
F.B.I. Is Ordered to Find Iraqis Whose Visas Have Expired
By JAMES BARRON
   The Federal Bureau of Investigation has been ordered to track down as many as 3,000 Iraqis in this country whose visas have expired, the Justice Department said yesterday.
   The announcement came as security precautions were tightened throughout the United States. From financial exchanges in lower Manhattan to cloakrooms in Washington and homeless shelters in California, unfamiliar rituals were the order of the day. In many cities, identification badges were being given close scrutiny in office buildings that used to be open to a

In [35]:
# Removing all Break Lines and Multiple Occurrences of Dots and Commas
import re

pattern = re.compile(r'(\n|   )') # raw string that identifies break lines and three occurrences of spaces
cleared_text_content = pattern.sub(repl='', string=text_content)

In [36]:
# Removing Dirty Characters:
#
#   - if we look closer to the text, we will realize that the document uses different characters
# for quotations:`“, ´´, ”, ’’, "`. It probably has happened because the file has been extracted
# via `Optical Character Recognition (OCR)` that wasn't able to accuratly recognize the double
# quotes (") properly in all its occurencies;
#
#   - in order to keep a pattern for quotations, let's replace all these characters by a double
# quotes (").
#
pattern_2 = re.compile(r'(“|´´|”|’’)')
cleared_text_content = pattern_2.sub(repl='"', string=cleared_text_content)
print(f'- Cleared Text: {cleared_text_content}')



<h1 id='3-processing-multiple-files' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🏷️ | Processing Multiple Files</h1>

Sometimes, we will be working not only with a single file, but yes, with multiple ones and, believe me, the action of opening each one with `with-blocks` and reading their content is booring, even if we create a `for-loop` to the task.

Besides, Windows, Linux and MacOS have differences in their file system paths, using different characters and structures to access files into the Machine.

So, in order to read multiple files and deal with different file system path structures, we can stick to using `pathlib` package (we already used it in the previous code), killing two birds with one stone!!

Oh, and with `pathlib`, we can even apply filters to only consider specific files. For instance, picture a folder containing text (text) and comma-sepparated-values (csv) files and we desire to only read the text ones; to do it, we can simply use the patter `*.txt` that tells to read all files that contains the `.txt` as extension.

In [41]:
# Exploring Datasets Path
DATASETS_PATH = Path('datasets')
DATASETS_PATH

WindowsPath('datasets')

In [42]:
# Exploring Datasets Path
print(f'- Path: {DATASETS_PATH}')
print(f'- Does the path exist? {DATASETS_PATH.exists()}')
print(f'- Is the path a directory/folder? {DATASETS_PATH.is_dir()}')
print(f'- Is the path a file? {DATASETS_PATH.is_file()}')

- Path: datasets
- Does the path exist? True
- Is the path a directory/folder? True
- Is the path a file? False


In [45]:
# Reading only TXT files
text_files = list(DATASETS_PATH.glob(pattern='*.txt'))
print(f'- Fetched TXT Files Path: {text_files}')

- Fetched TXT Files Path: [WindowsPath('datasets/NYT_1991-01-16-A15.txt'), WindowsPath('datasets/occupy.txt'), WindowsPath('datasets/WP_1990-08-10-25A.txt'), WindowsPath('datasets/WP_1991-01-17-A1B.txt')]


In [47]:
# Reading and Removing Multiple Occurrences of Dots and Commas
pattern = re.compile(r'(/.|,){2,}')

for text_file in text_files:
    text_content = text_file.read_text(encoding='UTF-8')

    modified_text_content = pattern.sub(repl='', string=text_content)
    modified_filename = f'modified-{text_file.name}'
    modified_file_path = Path('datasets', 'modified', modified_filename)
    modified_file_path.write_text(modified_text_content, encoding='UTF-8')

---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).