# Spell Correction Evaluation

In this repository, based on the following dataset, we evaluate and compare the autocorrect library and the tool that we implemented from the scratch using **wordnet** and **nltk** based on accuracy.

## Dataset
https://www.kaggle.com/datasets/bittlingmayer/spelling

We Have two Spell Correction Tools
1. we implemented spell corrector using **wordnet** and **nltk**
2. autocorrect library

### Load Dataset

In [108]:
import pandas as pd

df = pd.read_csv("spell_corpus/aspell.txt", header=None, sep=":", names=["correct", "incorrect"])
df.head()

Unnamed: 0,correct,incorrect
0,nevada,nevade
1,ability,abilitey
2,about,abouy
3,absorption,absorbtion
4,accidentally,accidently


In [109]:
print(df.shape) # we have 450 words for evaluations

(450, 2)


In [110]:
correct_list = list(df["correct"])
incorrect_list = list(df["incorrect"])

## 1.spell corrector with wordnet and nltk and edit distance score

In [111]:
import nltk
from nltk.metrics.distance  import edit_distance

In [112]:
nltk.download('wordnet')

from nltk.corpus import wordnet

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\AsusIran\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### For Example

In [113]:
incorrect_words = ['happpy', 'azmaing', 'intelliengt']
  
# loop for finding correct spellings based on edit distance and
for word in incorrect_words:
    temp = [(edit_distance(word, w),w) for w in wordnet.words() if w[0]==word[0]]
    print(sorted(temp, key = lambda val:val[0])[0][1])

happy
amazing
intelligent


### Evaluation by Accuracy

In [114]:
# loop for finding correct spellings based on edit distance and
spell_predicted_list = []
for word in incorrect_list:
    word = word.strip()
    temp = [(edit_distance(word, w),w) for w in wordnet.words() if w[0]==word[0]]
    spell_predicted = sorted(temp, key = lambda val:val[0])[0][1]
    spell_predicted_list.append(spell_predicted)

- spell correction duration time on **450 words**: 4m 34.2s

### some of spell correction (25 samples)

In [125]:
for i in range(0, 25):
    print("Incorrect:", incorrect_list[i], "| Predicted: ", spell_predicted_list[i], "| Actual Spell: ", correct_list[i])

Incorrect:  nevade | Predicted:  nevada | Actual Spell:  nevada
Incorrect:  abilitey | Predicted:  ability | Actual Spell:  ability
Incorrect:  abouy | Predicted:  about | Actual Spell:  about
Incorrect:  absorbtion | Predicted:  absorption | Actual Spell:  absorption
Incorrect:  accidently | Predicted:  accidental | Actual Spell:  accidentally
Incorrect:  accomodate acommadate | Predicted:  accommodational | Actual Spell:  accommodate
Incorrect:  acord | Predicted:  acold | Actual Spell:  accord
Incorrect:  aquantance | Predicted:  acquaintance | Actual Spell:  acquaintance
Incorrect:  equire | Predicted:  equine | Actual Spell:  acquire
Incorrect:  adultry | Predicted:  adultery | Actual Spell:  adultery
Incorrect:  aggresive | Predicted:  aggressive | Actual Spell:  aggressive
Incorrect:  alchohol | Predicted:  alcohol | Actual Spell:  alcohol
Incorrect:  alchoholic | Predicted:  alcoholic | Actual Spell:  alcoholic
Incorrect:  allieve | Predicted:  alive | Actual Spell:  alive
Inco

### Calculate accuracy

In [126]:
true_prediction = 0
for i in range(len(correct_list)):
    if spell_predicted_list[i] == correct_list[i]:
        true_prediction += 1

print("Number of Samples: ", len(correct_list))
print("Number of True Spell Correction: ", true_prediction)
print("Accuracy: ", (true_prediction/len(correct_list))*100, "%")

Number of Samples:  450
Number of True Spell Correction:  173
Accuracy:  38.44444444444444 %


## 2. autocorrect library

In [127]:
! pip install autocorrect



In [128]:
from autocorrect import Speller

speller = Speller()

spell_predicted_list = []
for word in incorrect_list:
    clean = speller(word.strip())
    spell_predicted_list.append(clean)

- spell correction duration time on **450 words**: 19.1s

### some of spell correction (25 samples)

In [129]:
for i in range(0, 25):
    print("Incorrect:", incorrect_list[i], "| Predicted: ", spell_predicted_list[i], "| Actual Spell: ", correct_list[i])

Incorrect:  nevade | Predicted:  evade | Actual Spell:  nevada
Incorrect:  abilitey | Predicted:  ability | Actual Spell:  ability
Incorrect:  abouy | Predicted:  about | Actual Spell:  about
Incorrect:  absorbtion | Predicted:  absorption | Actual Spell:  absorption
Incorrect:  accidently | Predicted:  accident | Actual Spell:  accidentally
Incorrect:  accomodate acommadate | Predicted:  accommodate accommodate | Actual Spell:  accommodate
Incorrect:  acord | Predicted:  cord | Actual Spell:  accord
Incorrect:  aquantance | Predicted:  acquaintance | Actual Spell:  acquaintance
Incorrect:  equire | Predicted:  require | Actual Spell:  acquire
Incorrect:  adultry | Predicted:  adultery | Actual Spell:  adultery
Incorrect:  aggresive | Predicted:  aggressive | Actual Spell:  aggressive
Incorrect:  alchohol | Predicted:  alcohol | Actual Spell:  alcohol
Incorrect:  alchoholic | Predicted:  alcoholic | Actual Spell:  alcoholic
Incorrect:  allieve | Predicted:  believe | Actual Spell:  ali

### Calculate accuracy

In [130]:
true_prediction = 0
for i in range(len(correct_list)):
    if spell_predicted_list[i] == correct_list[i]:
        true_prediction += 1

print("Number of Samples: ", len(correct_list))
print("Number of True Spell Correction: ", true_prediction)
print("Accuracy: ", (true_prediction/len(correct_list))*100, "%")

Number of Samples:  450
Number of True Spell Correction:  182
Accuracy:  40.44444444444444 %
