Skip to content

Commit

Permalink
Integrated a preprocessing function that aims to remove non-character…
Browse files Browse the repository at this point in the history
…s from a given string;

Bugfixing;
Refactoring;
Added new test;
Added more examples to Demo.ipynb;
Updated README.md;
  • Loading branch information
Halvani committed Jun 12, 2024
1 parent 177a6e9 commit d8eef54
Show file tree
Hide file tree
Showing 5 changed files with 334 additions and 52 deletions.
201 changes: 196 additions & 5 deletions Demo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -607,6 +607,197 @@
"ws.is_featural(\"랑이와 곶\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filtering (only for script type: *alphabet*) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### List only lower/upper case characters"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"a b c d e f g h i j k l m n o p q r s t u v w x y z ß ä ö ü\n",
"A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Ä Ö Ü\n"
]
}
],
"source": [
"german_alphabet_lower = ws.by_language(ws.Language.German, letter_case=ws.LetterCase.Lower)\n",
"ws.pretty_print(german_alphabet_lower)\n",
"\n",
"# Note that there is no capital letter for the \"ß\" in German, as it can never be at the beginning of a word.\n",
"german_alphabet_upper = ws.by_language(ws.Language.German, letter_case=ws.LetterCase.Upper)\n",
"ws.pretty_print(german_alphabet_upper)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Filter out digraphs (multigraphs)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"A B C Ç D E Ë F G H I J K L M N O P Q R S T U V X Y Z a b c ç d e ë f g h i j k l m n o p q r s t u v x y z\n"
]
}
],
"source": [
"albanian_alphabet_no_multigraphs = ws.by_language(ws.Language.Albanian, \n",
" strip_multigraphs=True, \n",
" multigraphs_size=ws.MultigraphSize.Digraph)\n",
"\n",
"ws.pretty_print(albanian_alphabet_no_multigraphs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Filter out diacritics (acute, grave, circumflex, cedilla, etc.)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z Æ æ Œ œ\n"
]
}
],
"source": [
"french_alphabet_no_diacritics = ws.by_language(ws.Language.French, strip_diacritics=True)\n",
"ws.pretty_print(french_alphabet_no_diacritics)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preprocessing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Remove non-characters (any character not belonging to those script types supported by Alphabetic)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Czyż to nie miłe\n"
]
}
],
"source": [
"print(ws.keep_only_script_characters(\"§§..Czyż t+o nie miłe?\"))"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"here it is\n"
]
}
],
"source": [
"print(ws.keep_only_script_characters(\"546here!\"\"§$ //(it)\\\\746 is*#*~~!!\"))"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"dirty string\n"
]
}
],
"source": [
"print(ws.keep_only_script_characters(\"||><_d_i_r-t-y %$string-+++\"))"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"niini'kokoh'u3ecoo3i\n"
]
}
],
"source": [
"# Note that the “3” remains, as it is a valid letter in the Arapaho language\n",
"# https://www.omniglot.com/writing/arapaho.htm \n",
"print(ws.keep_only_script_characters(\"12456niini'kokoh'u3ecoo3i789!!\"))"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"AllSpacesRemovedHere\n"
]
}
],
"source": [
"print(ws.keep_only_script_characters(\"*A=l?l*Sp~aces Rem`&``oved?} Here \", keep_spaces=False))"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -700,7 +891,7 @@
},
{
"cell_type": "code",
"execution_count": 32,
"execution_count": 40,
"metadata": {},
"outputs": [
{
Expand All @@ -724,7 +915,7 @@
},
{
"cell_type": "code",
"execution_count": 33,
"execution_count": 41,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -766,7 +957,7 @@
},
{
"cell_type": "code",
"execution_count": 34,
"execution_count": 42,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -822,7 +1013,7 @@
},
{
"cell_type": "code",
"execution_count": 36,
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -831,7 +1022,7 @@
},
{
"cell_type": "code",
"execution_count": 37,
"execution_count": 44,
"metadata": {},
"outputs": [
{
Expand Down
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,9 @@ ws.pretty_print(ws.by_language(ws.Language.Chinese_Simplified))
```
Another important use case is to check whether a given sequence of characters represents a specific script of a writing system. This can be achieved as follows:
```python
ws.is_abjad("גדולים או בינוניים") # True
ws.is_alphabet("גדולים או בינוניים") # False

ws.is_alphabet("dobré ráno") # True
ws.is_abjad("dobré ráno") # False

Expand All @@ -133,6 +136,14 @@ ws.is_alphabet("დილა მშვიდობისა") # True
ws.is_abjad("დილა მშვიდობისა") # False
```

Furthermore, you can also use Alphabetic to remove all characters from a given string that do not occur within the supported script types (abjads, abugidas, alphabets, etc.):

```python
ws.keep_only_script_characters("#jüste BAD/good tösté X4567Y ßÜ משהו действует?!")

# Result: 'jüste BADgood tösté XY ßÜ משהו действует'
```

## Features
- Currently [151 languages](#Supported_Languages) and corresponding scripts are supported, with more to follow over time;

Expand Down
2 changes: 1 addition & 1 deletion alphabetic/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
from alphabetic.core import *
from alphabetic.core import *
Loading

0 comments on commit d8eef54

Please sign in to comment.