Integrated a preprocessing function that aims to remove non-character…

…s from a given string; Bugfixing; Refactoring; Added new test; Added more examples to Demo.ipynb; Updated README.md;
Halvani · Jun 12, 2024 · d8eef54 · d8eef54
1 parent 177a6e9
commit d8eef54
Show file tree

Hide file tree

Showing 5 changed files with 334 additions and 52 deletions.
diff --git a/Demo.ipynb b/Demo.ipynb
@@ -607,6 +607,197 @@
     "ws.is_featural(\"랑이와 곶\")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Filtering (only for script type: *alphabet*) "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### List only lower/upper case characters"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "a b c d e f g h i j k l m n o p q r s t u v w x y z ß ä ö ü\n",
+      "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Ä Ö Ü\n"
+     ]
+    }
+   ],
+   "source": [
+    "german_alphabet_lower = ws.by_language(ws.Language.German, letter_case=ws.LetterCase.Lower)\n",
+    "ws.pretty_print(german_alphabet_lower)\n",
+    "\n",
+    "# Note that there is no capital letter for the \"ß\" in German, as it can never be at the beginning of a word.\n",
+    "german_alphabet_upper = ws.by_language(ws.Language.German, letter_case=ws.LetterCase.Upper)\n",
+    "ws.pretty_print(german_alphabet_upper)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Filter out digraphs (multigraphs)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "A B C Ç D E Ë F G H I J K L M N O P Q R S T U V X Y Z a b c ç d e ë f g h i j k l m n o p q r s t u v x y z\n"
+     ]
+    }
+   ],
+   "source": [
+    "albanian_alphabet_no_multigraphs = ws.by_language(ws.Language.Albanian, \n",
+    "                                                  strip_multigraphs=True, \n",
+    "                                                  multigraphs_size=ws.MultigraphSize.Digraph)\n",
+    "\n",
+    "ws.pretty_print(albanian_alphabet_no_multigraphs)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Filter out diacritics (acute, grave, circumflex, cedilla, etc.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z Æ æ Œ œ\n"
+     ]
+    }
+   ],
+   "source": [
+    "french_alphabet_no_diacritics = ws.by_language(ws.Language.French, strip_diacritics=True)\n",
+    "ws.pretty_print(french_alphabet_no_diacritics)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Preprocessing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Remove non-characters (any character not belonging to those script types supported by Alphabetic)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Czyż to nie miłe\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(ws.keep_only_script_characters(\"§§..Czyż t+o nie miłe?\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "here it is\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(ws.keep_only_script_characters(\"546here!\"\"§$ //(it)\\\\746 is*#*~~!!\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "dirty string\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(ws.keep_only_script_characters(\"||><_d_i_r-t-y %$string-+++\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 38,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "niini'kokoh'u3ecoo3i\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Note that the “3” remains, as it is a valid letter in the Arapaho language\n",
+    "# https://www.omniglot.com/writing/arapaho.htm \n",
+    "print(ws.keep_only_script_characters(\"12456niini'kokoh'u3ecoo3i789!!\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "AllSpacesRemovedHere\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(ws.keep_only_script_characters(\"*A=l?l*Sp~aces   Rem`&``oved?} Here \", keep_spaces=False))"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -700,7 +891,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 32,
+   "execution_count": 40,
    "metadata": {},
    "outputs": [
     {
@@ -724,7 +915,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 33,
+   "execution_count": 41,
    "metadata": {},
    "outputs": [
     {
@@ -766,7 +957,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 34,
+   "execution_count": 42,
    "metadata": {},
    "outputs": [
     {
@@ -822,7 +1013,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 36,
+   "execution_count": 43,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -831,7 +1022,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 37,
+   "execution_count": 44,
    "metadata": {},
    "outputs": [
     {

diff --git a/README.md b/README.md
@@ -117,6 +117,9 @@ ws.pretty_print(ws.by_language(ws.Language.Chinese_Simplified))
 ```
 Another important use case is to check whether a given sequence of characters represents a specific script of a writing system. This can be achieved as follows:
 ```python
+ws.is_abjad("גדולים או בינוניים") # True
+ws.is_alphabet("גדולים או בינוניים") # False
+
 ws.is_alphabet("dobré ráno") # True
 ws.is_abjad("dobré ráno") # False
 
@@ -133,6 +136,14 @@ ws.is_alphabet("დილა მშვიდობისა") # True
 ws.is_abjad("დილა მშვიდობისა") # False
 ```
 
+Furthermore, you can also use Alphabetic to remove all characters from a given string that do not occur within the supported script types (abjads, abugidas, alphabets, etc.):  
+
+```python
+ws.keep_only_script_characters("#jüste BAD/good tösté X4567Y ßÜ משהו действует?!")
+
+# Result: 'jüste BADgood tösté XY ßÜ משהו действует'
+```
+
 ## Features
 - Currently [151 languages](#Supported_Languages) and corresponding scripts are supported, with more to follow over time;
 

diff --git a/alphabetic/__init__.py b/alphabetic/__init__.py
@@ -1 +1 @@
-from alphabetic.core import *
+from alphabetic.core import *