<a href="https://colab.research.google.com/github/Nuwantha97/Sinhala_spell_and_grammer_checker/blob/Notebooks/Spell_checker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Mount drive

In [11]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive



Spelling Errors:
Focus on the correct arrangement of letters in a word to match its standard or dictionary form.
- Example: "adres" instead of "address."

Grammar Errors:
Focus on the syntax, word forms, and sentence structure to convey proper meaning and adhere to language rules.
- Example: "He going to school yesterday" instead of "He went to school yesterday."

#Edit Distance (Levenshtein Distance)


The Levenshtein library in Python is a specialized tool for computing Levenshtein distances

- Levenshtein Distance: Calculates the minimum edit distance between two strings.
- Levenshtein Similarity: Measures how similar two strings are, typically on a scale from 0 to 1.
- Other Metrics:
 - Ratio: A normalized version of the distance (1 - distance/max length).
 - Hamming Distance: Number of positions where two strings of equal length differ.
 - Jaro-Winkler Similarity: A more nuanced similarity metric, especially for short strings.

In [27]:
!pip install python-Levenshtein

Collecting python-Levenshtein
  Downloading python_Levenshtein-0.26.1-py3-none-any.whl.metadata (3.7 kB)
Collecting Levenshtein==0.26.1 (from python-Levenshtein)
  Downloading levenshtein-0.26.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.2 kB)
Collecting rapidfuzz<4.0.0,>=3.9.0 (from Levenshtein==0.26.1->python-Levenshtein)
  Downloading rapidfuzz-3.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading python_Levenshtein-0.26.1-py3-none-any.whl (9.4 kB)
Downloading levenshtein-0.26.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (162 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.6/162.6 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rapidfuzz-3.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m41.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages:

# Create Sinhala dictionary (dataset 01)

## Split word from sentence

In [None]:
# Open the input text file in read mode
with open('/content/drive/MyDrive/Projects/Sinhala Spell and Grammer checker/Datasets/sinhala_full_word_list_2016-10-08.txt', 'r', encoding='utf-8') as infile:
    # Read all lines from the input file
    lines = infile.readlines()

# Create a list to store all words
words = []

# Loop through each line to extract words
for line in lines:
    # Split the line into words and extend the list
    words.extend(line.split())


## Remove dublicate words

In [None]:
unique_words = []
seen = set()
for word in words:
    if word not in seen:
        unique_words.append(word)
        seen.add(word)

## Remove Non sinhala words

In [None]:
import re

# Sinhala Unicode character range
sinhala_pattern = re.compile(r'^[\u0D80-\u0DFF]+$')

# Filter Sinhala words
sinhala_words = [word for word in unique_words if sinhala_pattern.match(word)]

In [None]:
import csv

# Open the output CSV file in write mode
with open('/content/drive/MyDrive/Projects/Sinhala Spell and Grammer checker/Datasets/sinhala_dict1.csv', 'w', encoding='utf-8', newline='') as outfile:
    writer = csv.writer(outfile)

    # Write each word as a new row in the CSV
    for word in sinhala_words:
        writer.writerow([word])

print("Words have been written to 'sinhala_dict1.csv' line by line.")


Words have been written to 'sinhala_dict1.csv' line by line.


# Dataset 02

In [None]:
import pandas as pd

# Load the .xlsx file
file_path = "/content/drive/MyDrive/Projects/Sinhala Spell and Grammer checker/Datasets/data-spell-checker.xlsx"  # Replace with your file's path
data = pd.read_excel(file_path)

# Filter the rows where label == 1
filtered_data = data[data['label'] == 1]

# Select only the words column
words_with_label_1 = filtered_data['word']

# Save the filtered words to a .csv file
output_file_path = "/content/drive/MyDrive/Projects/Sinhala Spell and Grammer checker/Datasets/sinhala_dict2.csv"  # Replace with your desired output file name
words_with_label_1.to_csv(output_file_path, index=False, header=False)

print(f"Filtered words saved to {output_file_path}")


Filtered words saved to /content/drive/MyDrive/Projects/Sinhala Spell and Grammer checker/Datasets/sinhala_dict2.csv


# Spell check

In [60]:
# Function to load the dictionary from a text file
def load_dictionary(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        dictionary = [line.strip() for line in file]
    return dictionary

In [None]:
import Levenshtein

def spell_check(word, dictionary, top_n=3):
    # List to store words with their distances
    word_distances = []

    for correct_word in dictionary:
        # Calculate the Levenshtein distance between the word and dictionary word
        distance = Levenshtein.distance(word, correct_word)
        word_distances.append((correct_word, distance))

    # Sort the list by distance (ascending order)
    word_distances.sort(key=lambda x: x[1])

    # Return the top N closest words
    return word_distances[:top_n]

def check_sentence(sentence, sinhala_dictionary):
    words = sentence.split()  # Split the input sentence into words
    corrected_words = []  # List to store corrected words
    distances = []  # List to store Levenshtein distances for each word

    for word in words:
        # Get the top suggestion (closest word) and its distance
        top_words = spell_check(word, sinhala_dictionary, top_n=3)
        if top_words:  # Ensure there's at least one suggestion
            corrected_word, distance = top_words[0]  # Top suggestion
            corrected_words.append(corrected_word)  # Add corrected word
            distances.append(distance)  # Add the distance
        else:
            # If no suggestions, append the original word
            corrected_words.append(word)
            distances.append(None)  # No distance available

        print_suggestion(word, top_words)

    # Combine corrected words into a single sentence
    corrected_sentence = ' '.join(corrected_words)

    # Return values
    return sentence, corrected_sentence, distances

def print_suggestion(word, top_words):
    print(f"Suggestions for '{word}':")
    for i, (correct_word, distance) in enumerate(top_words, 1):
        print(f"{i}. {correct_word} (Distance: {distance})")

In [None]:
# Load the dictionary from the text file
sinhala_dictionary = load_dictionary('/content/drive/MyDrive/Projects/Sinhala Spell and Grammer checker/Datasets/sinhala_dict2.csv')

# Input word
sentence = "මම පොත් ලයනවා"

# Perform spell check
sentence, corrected_sentence, distances = check_sentence(sentence, sinhala_dictionary)
print(f"\nInput Sentence: {sentence}")
print(f"Suggested Correction: {corrected_sentence}")
print(f"Levenshtein Distances: {distances}")

Suggestions for 'මම':
1. මම (Distance: 0)
2. ඇම (Distance: 1)
3. ඉම (Distance: 1)
Suggestions for 'පොත්':
1. පහත් (Distance: 1)
2. පොතන (Distance: 1)
3. පොත්ත (Distance: 1)
Suggestions for 'ලයනවා':
1. උයනවා (Distance: 1)
2. ගයනවා (Distance: 1)
3. යනවා (Distance: 1)

Input Sentence: මම පොත් ලයනවා
Suggested Correction: මම පහත් උයනවා
Levenshtein Distances: [0, 1, 1]


# International Phonetic Alphabet to Sinhala

In [69]:
def sinhala_to_ipa(text):
    consonant_map = {
        "ක": "k", "ඛ": "kʰ", "ග": "ɡ", "ඝ": "ɡʱ",
        "ඞ": "ŋ", "ඟ": "ŋɡ", "ච": "ʧ", "ඡ": "ʧʰ",
        "ජ": "ʤ", "ඣ": "ʤʱ", "ඤ": "ɲ", "ඥ": "ɡn",
        "ට": "ʈ", "ඨ": "ʈʰ", "ඩ": "ɖ", "ඪ": "ɖʱ",
        "ණ": "ɳ", "ත": "t̪", "ථ": "t̪ʰ", "ද": "d̪",
        "ධ": "d̪ʱ", "න": "n̪", "ප": "p", "ඵ": "pʰ",
        "බ": "b", "භ": "bʱ", "ම": "m", "ය": "j",
        "ර": "r", "ල": "l", "ව": "ʋ", "ශ": "ʃ",
        "ෂ": "ʂ", "ස": "s", "හ": "h", "ළ": "ɭ",
        "ෆ": "f"
    }

    vowel_map = {
        "අ": "ʌ", "ආ": "aː", "ඇ": "æ", "ඈ": "æː",
        "ඉ": "i", "ඊ": "iː", "උ": "u", "ඌ": "uː",
        "එ": "e", "ඒ": "eː", "ඔ": "o", "ඕ": "oː",
        "ා": "aː", "ැ": "æ", "ෑ": "æː", "ි": "i",
        "ී": "iː", "ු": "u", "ූ": "uː", "ෙ": "e",
        "ේ": "eː", "ො": "o", "ෝ": "oː", "ෞ": "au"
    }

    def get_next_chars(pos, text, count=3):
        result = []
        for i in range(count):
            if pos + i < len(text):
                result.append(text[pos + i])
            else:
                result.append(None)
        return result

    def process_syllable(pos, text):
        char = text[pos]
        next_chars = get_next_chars(pos + 1, text)

        if char == " ":
            return "| ", 1

        if char == "අ" and next_chars[0] == "ං":
            return "ʌŋ ", 2

        if char == "ං":
            return "", 1

        if char in consonant_map:
            if char == "න" and pos + 2 < len(text) and text[pos:pos+3] == "නවා":
                return "n̪ ə ", 1

            if char == "ව" and pos + 1 < len(text) and text[pos:pos+2] == "වා":
                return "ʋ a ", 2

            base = consonant_map[char] + " "

            if next_chars[0] in vowel_map:
                return base, 1
            elif next_chars[0] == "්":
                return base, 2
            elif pos == len(text) - 1:
                return base + "ə ", 1
            else:
                if char == "ක" and pos == 1:
                    return base + "ʌ ", 1
                else:
                    return base + "ʌ ", 1

        elif char in vowel_map:
            if char == "ා" and pos == len(text) - 1:
                return "", 1
            return vowel_map[char] + " ", 1

        elif char == "්":
            return "", 1

        return char + " ", 1

    result = ""
    i = 0
    while i < len(text):
        segment, skip = process_syllable(i, text)
        result += segment
        i += skip

    result = result.strip()
    while "  " in result:
        result = result.replace("  ", " ")

    return result

def ipa_to_sinhala(ipa_text):
    ipa_map = {
        # Base vowels
        "ʌ": "අ", "aː": "ආ", "æ": "ඇ", "æː": "ඈ",
        "i": "ඉ", "iː": "ඊ", "u": "උ", "uː": "ඌ",
        "e": "එ", "eː": "ඒ", "o": "ඔ", "oː": "ඕ",

        # Consonants
        "k": "ක", "kʰ": "ඛ", "ɡ": "ග", "ɡʱ": "ඝ",
        "ŋ": "ං", "ŋɡ": "ඟ", "ʧ": "ච", "ʧʰ": "ඡ",
        "ʤ": "ජ", "ʤʱ": "ඣ", "ɲ": "ඤ", "ʈ": "ට",
        "ʈʰ": "ඨ", "ɖ": "ඩ", "ɖʱ": "ඪ", "ɳ": "ණ",
        "t̪": "ත", "t̪ʰ": "ථ", "d̪": "ද", "d̪ʱ": "ධ",
        "n̪": "න", "p": "ප", "pʰ": "ඵ", "b": "බ",
        "bʱ": "භ", "m": "ම", "j": "ය", "r": "ර",
        "l": "ල", "ʋ": "ව", "s": "ස", "h": "හ",
        "ʃ": "ශ", "ʂ": "ෂ", "ɭ": "ළ", "f": "ෆ"
    }

    # Split IPA text into words using the vertical bar or multiple spaces as delimiter
    words = [word.strip() for word in ipa_text.replace("|", " ").split("  ")]
    result = []

    for word in words:
        tokens = word.split()
        word_result = ""
        i = 0

        while i < len(tokens):
            token = tokens[i]
            remaining_tokens = tokens[i:]

            if token == "ʌŋ" or (token == "ʌ" and i + 1 < len(tokens) and tokens[i+1] == "ŋ"):
                word_result += "අං"
                i += 2 if token == "ʌ" else 1
                continue

            if len(remaining_tokens) >= 4 and remaining_tokens[0:4] == ["n̪", "ə", "ʋ", "a"]:
                word_result += "නවා"
                i += 4
                continue

            if token in ipa_map:
                word_result += ipa_map[token]

                next_token = tokens[i + 1] if i + 1 < len(tokens) else None

                if next_token == "ə":
                    if i + 2 < len(tokens) and tokens[i+2] != "j":
                        word_result += "්"
                    i += 2
                elif next_token == "ʌ":
                    i += 2
                elif next_token == "a":
                    if i + 1 == len(tokens) - 1:
                        word_result += "ා"
                    i += 2
                else:
                    if token not in "ʌaːæiːuːeːoː":
                        next_is_consonant = (next_token in ipa_map and
                                          next_token not in "ʌaːæiːuːeːoː")
                        if next_is_consonant and next_token != "j":
                            word_result += "්"
                    i += 1
            else:
                i += 1

        result.append(word_result)

    return " ".join(result)

# Test cases
test_cases = [
    ("අංකනය", "ʌŋ k ʌ n̪ ə j ə"),
    ("කරනවා", "k ʌ r ʌ n̪ ə ʋ a"),
    ("ගහනවා", "ɡ ʌ h ʌ n̪ ə ʋ a"),
    ("මට", "m ʌ ʈ ə"),
    ("බලන්න", "b ʌ l ʌ n̪ n̪ ə"),
    ("බලන්න අහස", "b ʌ l ʌ n̪ n̪ ʌ | ʌ h ʌ s ə")
]

print("Testing Sinhala to IPA conversion:")
for sinhala, expected_ipa in test_cases:
    result = sinhala_to_ipa(sinhala)
    print(f"Sinhala: {sinhala}")
    print(f"Expected IPA: {expected_ipa}")
    print(f"Got IPA: {result}")
    print()

print("Testing IPA to Sinhala conversion:")
for sinhala, ipa in test_cases:
    result = ipa_to_sinhala(ipa)
    print(f"IPA: {ipa}")
    print(f"Expected Sinhala: {sinhala}")
    print(f"Got Sinhala: {result}")
    print()

Testing Sinhala to IPA conversion:
Sinhala: අංකනය
Expected IPA: ʌŋ k ʌ n̪ ə j ə
Got IPA: ʌŋ k ʌ n̪ ʌ j ə

Sinhala: කරනවා
Expected IPA: k ʌ r ʌ n̪ ə ʋ a
Got IPA: k ʌ r ʌ n̪ ə ʋ a

Sinhala: ගහනවා
Expected IPA: ɡ ʌ h ʌ n̪ ə ʋ a
Got IPA: ɡ ʌ h ʌ n̪ ə ʋ a

Sinhala: මට
Expected IPA: m ʌ ʈ ə
Got IPA: m ʌ ʈ ə

Sinhala: බලන්න
Expected IPA: b ʌ l ʌ n̪ n̪ ə
Got IPA: b ʌ l ʌ n̪ n̪ ə

Sinhala: බලන්න අහස
Expected IPA: b ʌ l ʌ n̪ n̪ ʌ | ʌ h ʌ s ə
Got IPA: b ʌ l ʌ n̪ n̪ ʌ | ʌ h ʌ s ə

Testing IPA to Sinhala conversion:
IPA: ʌŋ k ʌ n̪ ə j ə
Expected Sinhala: අංකනය
Got Sinhala: අංකනය

IPA: k ʌ r ʌ n̪ ə ʋ a
Expected Sinhala: කරනවා
Got Sinhala: කරනවා

IPA: ɡ ʌ h ʌ n̪ ə ʋ a
Expected Sinhala: ගහනවා
Got Sinhala: ගහනවා

IPA: m ʌ ʈ ə
Expected Sinhala: මට
Got Sinhala: මට

IPA: b ʌ l ʌ n̪ n̪ ə
Expected Sinhala: බලන්න
Got Sinhala: බලන්න

IPA: b ʌ l ʌ n̪ n̪ ʌ | ʌ h ʌ s ə
Expected Sinhala: බලන්න අහස
Got Sinhala: බලන්න අහස



In [70]:
# Example Usage
sinhala_word = "බලන්න අහස"  # Input Sinhala word
ipa_transcription = sinhala_to_ipa(sinhala_word)
print("Sinhala Word:", sinhala_word)
print("IPA Transcription:", ipa_transcription)

Sinhala Word: බලන්න අහස
IPA Transcription: b ʌ l ʌ n̪ n̪ ʌ | ʌ h ʌ s ə


In [71]:
import pandas as pd

# Load the dataset without headers (it doesn't have column names)
file_path = '/content/drive/MyDrive/Projects/Sinhala Spell and Grammer checker/Datasets/sinhala_dict2.csv'
df = pd.read_csv(file_path, header=None, names=['word'])

# Apply the `sinhala_to_ipa` function to each word
df['IPA'] = df['word'].apply(sinhala_to_ipa)

# Save the new dataset with 'word' and 'IPA' columns to a new CSV file
output_path = '/content/drive/MyDrive/Projects/Sinhala Spell and Grammer checker/Datasets/sinhala_dict_with_ipa.csv'
df[['word', 'IPA']].to_csv(output_path, index=False)

print("New dataset with IPA has been saved to:", output_path)


New dataset with IPA has been saved to: /content/drive/MyDrive/Projects/Sinhala Spell and Grammer checker/Datasets/sinhala_dict_with_ipa.csv


In [72]:
df

Unnamed: 0,word,IPA
0,අභිචෝදකයා,ʌ bʱ i ʧ oː d̪ ʌ k ʌ j
1,අංකනය,ʌŋ k ʌ n̪ ʌ j ə
2,අංකන,ʌŋ k ʌ n̪ ə
3,අංකය,ʌŋ k ʌ j ə
4,අංකාන්තරය,ʌŋ k aː n̪ t̪ ʌ r ʌ j ə
...,...,...
67255,හැරීය,h æ r iː j ə
67256,හැරීයන්නේ,h æ r iː j ʌ n̪ n̪ eː
67257,හැරීයයි,h æ r iː j ʌ j i
67258,හැරීයාම,h æ r iː j aː m ə


In [73]:
import Levenshtein

def spell_check(word, dictionary, top_n=3):
    # List to store words with their distances
    word_distances = []

    for correct_word in dictionary:
        # Calculate the Levenshtein distance between the word and dictionary word
        distance = Levenshtein.distance(word, correct_word)
        word_distances.append((correct_word, distance))

    # Sort the list by distance (ascending order)
    word_distances.sort(key=lambda x: x[1])

    # Return the top N closest words
    return word_distances[:top_n]

def check_sentence(sentence, sinhala_dictionary):
    words = sentence.split()  # Split the input sentence into words
    corrected_words = []  # List to store corrected words
    distances = []  # List to store Levenshtein distances for each word

    for word in words:
        # Get the top suggestion (closest word) and its distance
        top_words = spell_check(word, sinhala_dictionary, top_n=3)
        if top_words:  # Ensure there's at least one suggestion
            corrected_word, distance = top_words[0]  # Top suggestion
            corrected_words.append(corrected_word)  # Add corrected word
            distances.append(distance)  # Add the distance
        else:
            # If no suggestions, append the original word
            corrected_words.append(word)
            distances.append(None)  # No distance available

        print_suggestion(word, top_words)

    # Combine corrected words into a single sentence
    corrected_sentence = ' '.join(corrected_words)

    # Return values
    return sentence, corrected_sentence, distances

def print_suggestion(word, top_words):
    print(f"Suggestions for '{word}':")
    for i, (correct_word, distance) in enumerate(top_words, 1):
        print(f"{i}. {correct_word} (Distance: {distance})")

In [78]:
import Levenshtein

def spell_check(word, dictionary, top_n=3):
    # List to store words with their distances
    word_distances = []

    for correct_word in dictionary:
        # Calculate the Levenshtein distance between the word and dictionary word
        distance = Levenshtein.distance(word, correct_word)
        word_distances.append((correct_word, distance))

    # Sort the list by distance (ascending order)
    word_distances.sort(key=lambda x: x[1])

    # Return the top N closest words
    return word_distances[:top_n]

def check_sentence(sentence, sinhala_dictionary):
    words = sentence.split()  # Split the input sentence into words
    corrected_words = []  # List to store corrected words
    distances = []  # List to store Levenshtein distances for each word

    for word in words:
        # Get the top suggestion (closest word) and its distance
        top_words = spell_check(sinhala_to_ipa(word), sinhala_dictionary, top_n=3)
        if top_words:  # Ensure there's at least one suggestion
            corrected_word, distance = top_words[0]  # Top suggestion
            corrected_words.append(ipa_to_sinhala(corrected_word))  # Add corrected word
            distances.append(distance)  # Add the distance
        else:
            # If no suggestions, append the original word
            corrected_words.append(word)
            distances.append(None)  # No distance available

        print_suggestion(word, top_words)

    # Combine corrected words into a single sentence
    corrected_sentence = ' '.join(corrected_words)

    # Return values
    return sentence, corrected_sentence, distances

def print_suggestion(word, top_words):
    top_words = [(ipa_to_sinhala(correct_word), distance) for correct_word, distance in top_words]
    print(f"Suggestions for '{word}':")
    for i, (correct_word, distance) in enumerate(top_words, 1):
        print(f"{i}. {correct_word} (Distance: {distance})")

In [74]:
import pandas as pd
def load_dictionary(file_path):
  df = pd.read_csv(file_path)
  return df

In [82]:
# Load the dictionary from the text file
sinhala_dictionary = load_dictionary('/content/drive/MyDrive/Projects/Sinhala Spell and Grammer checker/Datasets/sinhala_dict_with_ipa.csv')

# Input word
sentence = "මම ආහස බලනව"

# Perform spell check
sentence, corrected_sentence, distances = check_sentence(sentence, sinhala_dictionary['IPA'])
print(f"\nInput Sentence: {sentence}")
print(f"Suggested Correction: {corrected_sentence}")
print(f"Levenshtein Distances: {distances}")

Suggestions for 'මම':
1. මම (Distance: 0)
2. ගම (Distance: 1)
3. මඩ (Distance: 1)
Suggestions for 'ආහස':
1. අහස (Distance: 2)
2. ආගම (Distance: 2)
3. ආපසඋ (Distance: 2)
Suggestions for 'බලනව':
1. කලනය (Distance: 2)
2. බමනය (Distance: 2)
3. බලතල (Distance: 2)

Input Sentence: මම ආහස බලනව
Suggested Correction: මම අහස කලනය
Levenshtein Distances: [0, 2, 2]
