# Spellchecker-Utility for testing the generated TXTs

These Code-Blocks can be used to determine the following characteristigs of a .txt file:

1. Spelling and grammar mistakes combined (count and %-share)
2. Vocabulary size (number of different words) for all words and german only
3. Vocabulary frequency (histogram of used words) for all words and german only



To use this script, simply upload a .txt file, specify its name in the "filename" variable and run the notebook.

*Thanks to "Marco Polo" for aiding us with those python-libraries :)*

**1. Install all dependencies**

In [1]:
# Dependencies

!pip install language_tool_python
!pip install pyspellchecker
!apt install build-essential python3-dev libhunspell-dev
!pip install hunspell
!pip install nltk



Der Befehl "apt" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.


Collecting hunspell
  Using cached hunspell-0.5.5.tar.gz (34 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: hunspell
  Building wheel for hunspell (setup.py): started
  Building wheel for hunspell (setup.py): finished with status 'error'
  Running setup.py clean for hunspell
Failed to build hunspell


  error: subprocess-exited-with-error
  
  python setup.py bdist_wheel did not run successfully.
  exit code: 1
  
  [5 lines of output]
  running bdist_wheel
  running build
  running build_ext
  building 'hunspell' extension
  error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
  [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for hunspell
ERROR: Could not build wheels for hunspell, which is required to install pyproject.toml-based projects


**2. Import all needed libraries**

In [2]:
import re
import matplotlib.pyplot as plt
import language_tool_python
from spellchecker import SpellChecker
import hunspell
import nltk

ModuleNotFoundError: No module named 'hunspell'

**3. Run the preconfiguration**
- Please specify the name of the .txt file to check in this cell
- If you want to change the preview-sizes, you can change them here as well
- Plese dont forget to upload the needed files for hunspell into the runtime as stated in the comment below
- If you, for whatever reason, need to change the parameters of the spellchecking libraries, please use this cell to do so

In [None]:
# Configuration variables -> Change things here!

tool = language_tool_python.LanguageTool('de-DE', config={ 'cacheSize': 1000, 'pipelineCaching': True, 'maxSpellingSuggestions': 1 }) # LanguageTool Setup
spell = SpellChecker(language='de') # PySpellChecker Setup

# For HunSpell you will need files from here: https://github.com/elastic/hunspell/tree/master/dicts/de_DE
d = hunspell.HunSpell("de_DE.dic", "de_DE.aff") # Upload these two files from the provided GitHub URL into the instance!

# Initialization of nltk
nltk.download('words')
eng_words = nltk.corpus.words.words()

# Misc parameters
text_preview_len = 250 # Length of the .txt preview
vocab_hist_preview = 10 # Length of the vocabulary preview and german word preview
filename = 'Test.txt' # Name of the file to check

Downloading LanguageTool 5.7: 100%|██████████| 225M/225M [00:04<00:00, 45.3MB/s]
INFO:language_tool_python.download_lt:Unzipping /tmp/tmpcqtd0k9e.zip to /root/.cache/language_tool_python.
INFO:language_tool_python.download_lt:Downloaded https://www.languagetool.org/download/LanguageTool-5.7.zip to /root/.cache/language_tool_python.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


**4. Read in the .txt file here**
- There will be some stats and a short preview in the output

In [None]:
# Reading the textfile and previewing the first n characters

with open(filename, 'r', encoding='utf-8') as f:
  text = f.read()

words = text.split()
words = [re.sub(r'[^a-zA-ZßäöüÄÖÜ]', '', word) for word in words]

print("Lenth of file (words): ", len(words))
print("Lenth of file (chars): ", len(text), "\n")
print(text[:text_preview_len])

Lenth of file (words):  8
Lenth of file (chars):  53 

Hallo hallo, this Datei enthält two englische Wörter.


**5. Check the .txt files with different spellchecking-tools**
- In the following cells you will find code for LanguageTool, PySpellChecker and HunSpell
- The reason for this are the different vocabulary-lists and checking-scripts which the libraries use -> together they hopefully produce a somewhat clean result

In [None]:
#Check with LanguageTool

matches = tool.check(text)
print("LanguageTool analysis:")
print("Number of spelling mistakes: ", len(matches))
print("Error rate: ", (len(matches) / len(words)))

LanguageTool analysis:
Number of spelling mistakes:  3
Error rate:  0.375


In [None]:
# Check with PySpellChecker
misspelled = spell.unknown(words)
print("PySpellChecker analysis:")
print("Misspelled Words: ", len(misspelled))
print("Error Rate: ", (len(misspelled) / len(words)))

PySpellChecker analysis:
Misspelled Words:  2
Error Rate:  0.25


In [None]:
# Check with Hunspell

errors = []
for word in words:
    if not d.spell(word):
        errors.append(word)

print("HunSpell analysis:")
print("Misspelled Words: ", len(errors))
print("Error Rate: ", (len(errors) / len(words)))

HunSpell analysis:
Misspelled Words:  2
Error Rate:  0.25


**6. Gather information about the vocabulary**
- The following cell will analyze the spectrum and frequency of the used vocabulary
- This code considers all words, german and english

In [None]:
# Build vocabulary and print size

words_v = [word.lower() for word in words]
vocabulary_dict = dict.fromkeys(words_v)
vocabulary = list(vocabulary_dict)
print("Vocabulary size:", len(vocabulary))
print("\nThe " + str(vocab_hist_preview) + " most used words:")

# Build histogram from vocabulary and preview the n most used words

vocab_hist = []
for word in set(words_v):
  count = words_v.count(word)
  elem = (word, count)
  vocab_hist.append(elem)

vocab_hist.sort(key=lambda x: x[1], reverse=True)

for word, count in vocab_hist[:vocab_hist_preview]:
  print(f"{word}: {count}")

Vocabulary size: 7

The 10 most used words:
hallo: 2
two: 1
datei: 1
englische: 1
this: 1
enthält: 1
wörter: 1


**7. Gather information about the german vocabulary**

- Now we will analyze only the german words in the .txt file
- To extract them, an english word-list by nltk is used
- This script will output the german words in the .txt file as well as german vocabulary characteristics as seen above

In [None]:
# German words and german vocabulary

ger_words = [] # for building the vocabulary later
for word in words:
  if word not in eng_words:
    ger_words.append(word)

ger_words_tr = [] # for the correct text output (inefficient, but whatever :)
for voc in vocabulary:
  if voc not in eng_words:
    ger_words_tr.append(voc)

print("Number of german words, according to nltk:", len(ger_words))
print("German word rate:", (len(ger_words) / len(words)))
print("\nThe " + str(vocab_hist_preview) + " first german words:")
for entry in ger_words_tr[:vocab_hist_preview]:
  print(entry)

print("\n")

# Build german vocabulary and print size

words_ger = [word.lower() for word in ger_words]
vocabulary_dict_ger = dict.fromkeys(words_ger)
vocabulary_ger = list(vocabulary_dict_ger)
print("German vocabulary size:", len(vocabulary_ger))
print("\nThe " + str(vocab_hist_preview) + " most used german words:")

# Build german histogram from vocabulary and preview the n most used words

vocab_hist_ger = []
for word in set(words_ger):
  count = words_ger.count(word)
  elem = (word, count)
  vocab_hist_ger.append(elem)

vocab_hist_ger.sort(key=lambda x: x[1], reverse=True)

for word, count in vocab_hist_ger[:vocab_hist_preview]:
  print(f"{word}: {count}")

Number of german words, according to nltk: 6
German word rate: 0.75

The 10 first german words:
hallo
datei
enthält
englische
wörter


German vocabulary size: 5

The 10 most used german words:
hallo: 2
datei: 1
englische: 1
enthält: 1
wörter: 1
