# Spell Checkers Comparison Report

## Overview

This research focuses on comparing various spell-checking tools and evaluating their work.

## Dataset Selection

### Birkbeck Dataset
The Birkbeck dataset was chosen for this research due to its comprehensive nature and the following advantages:

- **Diversity of Errors**: The dataset contains a wide range of misspelled words, allowing for a thorough evaluation of spell-checking algorithms.
- **Real-World Relevance**: The errors in the dataset are derived from actual writing samples, primarily from schoolchildren, university students, and adult literacy learners. This ensures that the misspelled words represent common real-world mistakes, making the dataset particularly valuable for training and evaluating spell-checkers in practical scenarios.

### Data Usage
For practical reasons, only **3000 misspelled words** were selected from the dataset to ensure timely assessments by the project deadline. This subset allows for efficient processing while still providing meaningful insights into the performance of the spell checkers.

## Error Type Identification

To enhance the evaluation process, a script was developed to identify the type of error associated with each misspelled word. The types of errors categorized include:

- **Insertion**: An extra character is added.
- **Deletion**: A character is omitted.
- **Transposition**: Two adjacent characters are swapped.
- **Substitution**: A character is replaced with another.

This classification enables a detailed analysis of how well different spell checkers correct various types of errors.

## Data Formatting

### `birkbeck_misspell.csv`
The misspelled words were organized into a CSV file (`birkbeck_misspell.csv`) with the following format:

| correct_word | misspelling | error_type   |
|--------------|-------------|--------------|
| example      | exmple      | substitution |
| test         | tset        | transposition |
| word         | wrd         | deletion      |

Each row contains:
- **correct_word**: The correctly spelled word.
- **misspelling**: The misspelled version of the word.
- **error_type**: The type of error identified.

### `birkbeck_correct.txt`
Additionally, a separate file named `birkbeck_correct.txt` was created, containing 3000 correct words. Each line in this file consists of one correct word.  



# Analysis of Spell Checker Report

The evaluation of three different spell checkers — **PySpellChecker**, **TextBlobChecker**, and **AutocorrectChecker** — provided valuable insights into their performance based on key metrics such as accuracy, precision, recall, F1 score, fix rate, and speed. Below is a detailed analysis of the results for each tool, highlighting their strengths and weaknesses.

## 1. PySpellChecker

### Results:
- **Accuracy**: 0.9423
- **Precision**: 0.9477
- **Recall**: 0.9363
- **F1 Score**: 0.9420
- **Fix Rate**: 0.33937
- **Words per second**: 6.52

### Strengths:
- **High Accuracy and Precision**: PySpellChecker demonstrated excellent accuracy (94.23%) and precision (94.77%), indicating that it effectively identifies and corrects misspelled words with minimal false positives.
- **Strong Recall**: With a recall of 93.63%, it successfully captures most of the actual misspellings, making it reliable for comprehensive error detection.

### Weaknesses:
- **Moderate Fix Rate**: The fix rate of 33.93% suggests that while it is effective in identifying errors, it only corrects about one-third of the misspelled words, indicating room for improvement in its correction capabilities.
- **Processing Speed**: Although the speed is acceptable, it is slower compared to AutocorrectChecker.

## 2. TextBlobChecker

### Results:
- **Accuracy**: 0.7300
- **Precision**: 0.8436
- **Recall**: 0.5647
- **F1 Score**: 0.6765
- **Fix Rate**: 0.0553
- **Words per second**: 6.73

### Strengths:
- **Good Precision**: With a precision of 84.36%, TextBlobChecker is effective at minimizing false positives when it does flag a misspelling.
- **Speed**: It has a slightly better processing speed (6.73 words/second) than PySpellChecker.

### Weaknesses:
- **Low Accuracy and Recall**: An accuracy of only 73% and a recall of 56.47% indicate that this checker misses a significant number of errors and fails to identify many misspelled words.
- **Very Low Fix Rate**: The fix rate of just 5.53% suggests that it struggles to correct identified errors effectively, making it less useful in practical applications.

## 3. AutocorrectChecker

### Results:
- **Accuracy**: 0.8043
- **Precision**: 0.9076
- **Recall**: 0.6777
- **F1 Score**: 0.7760
- **Fix Rate**: 0.3347
- **Words per second**: 22.68

### Strengths:
- **High Speed**: AutocorrectChecker stands out with a processing speed of 22.68 words per second, making it the fastest among the three tools.
- **Balanced Performance Metrics**: With an accuracy of 80.43% and a precision of 90.76%, it effectively balances error detection and correction capabilities.

### Weaknesses:
- **Moderate Recall**: A recall of only 67.77% indicates that while it performs well in terms of precision, it does not catch all errors present in the text.
- **Fix Rate Similar to PySpellChecker**: While the fix rate (33.47%) is better than TextBlobChecker, it still shows that there is room for improvement in correcting identified errors.

# Conclusion

The comparative analysis of the three spell checkers reveals distinct strengths and weaknesses for each tool:

1. **PySpellChecker** excels in accuracy and precision but has a moderate fix rate, indicating effective identification but limited correction capabilities.
2. **TextBlobChecker**, while having good precision, suffers from low accuracy and recall, making it less reliable for comprehensive spell-checking tasks.
3. **AutocorrectChecker**, despite its high speed and balanced performance metrics, still needs improvements in recall to enhance its overall effectiveness.

In conclusion, while all three spell checkers have their merits, the choice of tool may depend on specific use cases — whether prioritizing speed, accuracy, or correction capability is more critical for the intended application in real-world scenarios.