<a href="https://colab.research.google.com/github/Nuwantha97/Sinhala_spell_and_grammer_checker/blob/Notebooks/Spell_checker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Mount drive

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#Edit Distance (Levenshtein Distance)


The Levenshtein library in Python is a specialized tool for computing Levenshtein distances

- Levenshtein Distance: Calculates the minimum edit distance between two strings.
- Levenshtein Similarity: Measures how similar two strings are, typically on a scale from 0 to 1.
- Other Metrics:
 - Ratio: A normalized version of the distance (1 - distance/max length).
 - Hamming Distance: Number of positions where two strings of equal length differ.
 - Jaro-Winkler Similarity: A more nuanced similarity metric, especially for short strings.

In [None]:
!pip install python-Levenshtein

Collecting python-Levenshtein
  Downloading python_Levenshtein-0.26.1-py3-none-any.whl.metadata (3.7 kB)
Collecting Levenshtein==0.26.1 (from python-Levenshtein)
  Downloading levenshtein-0.26.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.2 kB)
Collecting rapidfuzz<4.0.0,>=3.9.0 (from Levenshtein==0.26.1->python-Levenshtein)
  Downloading rapidfuzz-3.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading python_Levenshtein-0.26.1-py3-none-any.whl (9.4 kB)
Downloading levenshtein-0.26.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (162 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.6/162.6 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rapidfuzz-3.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m31.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages:

# Create Sinhala dictionary

## Split word from sentence

In [27]:
# Open the input text file in read mode
with open('/content/drive/MyDrive/Projects/Sinhala Spell and Grammer checker/sinhala_full_word_list_2016-10-08.txt', 'r', encoding='utf-8') as infile:
    # Read all lines from the input file
    lines = infile.readlines()

# Create a list to store all words
words = []

# Loop through each line to extract words
for line in lines:
    # Split the line into words and extend the list
    words.extend(line.split())


## Remove dublicate words

In [28]:
unique_words = []
seen = set()
for word in words:
    if word not in seen:
        unique_words.append(word)
        seen.add(word)

## Remove Non sinhala words

In [29]:
import re

# Sinhala Unicode character range
sinhala_pattern = re.compile(r'^[\u0D80-\u0DFF]+$')

# Filter Sinhala words
sinhala_words = [word for word in unique_words if sinhala_pattern.match(word)]

In [30]:
# Open the output text file in write mode
with open('/content/drive/MyDrive/Projects/Sinhala Spell and Grammer checker/sinhala_dict1.txt', 'w', encoding='utf-8') as outfile:
    # Write each word on a new line
    for word in sinhala_words:
        outfile.write(word + '\n')

print("Words have been written to 'sinhala_dict1.txt' line by line.")

Words have been written to 'output.txt' line by line.


# Spell check

In [33]:
# Function to load the dictionary from a text file
def load_dictionary(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        dictionary = [line.strip() for line in file]
    return dictionary

In [65]:
import Levenshtein

def spell_check(word, dictionary, top_n=3):
    # List to store words with their distances
    word_distances = []

    for correct_word in dictionary:
        # Calculate the Levenshtein distance between the word and dictionary word
        distance = Levenshtein.distance(word, correct_word)
        word_distances.append((correct_word, distance))

    # Sort the list by distance (ascending order)
    word_distances.sort(key=lambda x: x[1])

    # Return the top N closest words
    return word_distances[:top_n]

def check_sentence(sentence, sinhala_dictionary):
    words = sentence.split()  # Split the input sentence into words
    corrected_words = []  # List to store corrected words
    distances = []  # List to store Levenshtein distances for each word

    for word in words:
        # Get the top suggestion (closest word) and its distance
        top_words = spell_check(word, sinhala_dictionary, top_n=3)
        if top_words:  # Ensure there's at least one suggestion
            corrected_word, distance = top_words[0]  # Top suggestion
            corrected_words.append(corrected_word)  # Add corrected word
            distances.append(distance)  # Add the distance
        else:
            # If no suggestions, append the original word
            corrected_words.append(word)
            distances.append(None)  # No distance available

        print_suggestion(word, top_words)

    # Combine corrected words into a single sentence
    corrected_sentence = ' '.join(corrected_words)

    # Return values
    return sentence, corrected_sentence, distances

def print_suggestion(word, top_words):
    print(f"Suggestions for '{word}':")
    for i, (correct_word, distance) in enumerate(top_words, 1):
        print(f"{i}. {correct_word} (Distance: {distance})")

In [64]:
# Load the dictionary from the text file
sinhala_dictionary = load_dictionary('/content/drive/MyDrive/Projects/Sinhala Spell and Grammer checker/sinhala_dict1.txt')

# Input word
sentence = "සුභම සුභ රත්රිය"

# Perform spell check
sentence, corrected_sentence, distances = check_sentence(sentence, sinhala_dictionary)
print(f"\nInput Sentence: {sentence}")
print(f"Suggested Correction: {corrected_sentence}")
print(f"Levenshtein Distances: {distances}")

Suggestions for 'සුභම':
1. සුගම (Distance: 1)
2. සුභ (Distance: 1)
3. සුභග (Distance: 1)
Suggestions for 'සුභ':
1. සුභ (Distance: 0)
2. අසුභ (Distance: 1)
3. සුව (Distance: 1)
Suggestions for 'රත්රිය':
1. රත්නිය (Distance: 1)
2. ඇතිරිය (Distance: 2)
3. ඉතිරිය (Distance: 2)

Input Sentence: සුභම සුභ රත්රිය
Suggested Correction: සුගම සුභ රත්නිය
Levenshtein Distances: [1, 0, 1]
