**EC9640 - Artificial Intelligence Project**(2020/E/031,2020/E/076)

Grammar checker for Tamil

In [2]:
!pip install stanza
import stanza
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
# Import cross_val_score from the correct module
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression



In [19]:
# Load Tamil language model for Stanza
stanza.download('ta')
nlp = stanza.Pipeline('ta')

# Load dataset with error handling
dataset_path = '/content/drive/MyDrive/tamil_grammar_dataset2.csv'
data = pd.read_csv(dataset_path)

# Print original columns for inspection
print("Original Dataset Columns:", data.columns)


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: ta (Tamil) ...
INFO:stanza:File exists: /root/stanza_resources/ta/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: ta (Tamil):
| Processor | Package      |
----------------------------
| tokenize  | ttb          |
| mwt       | ttb          |
| pos       | ttb_nocharlm |
| lemma     | ttb_nocharlm |
| depparse  | ttb_nocharlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: depparse
INFO:stanza:Done loading processors!


Original Dataset Columns: Index(['Error Sentence', 'Corrected Sentence'], dtype='object')


In [20]:
# print full data set
data.shape
data.head()

Unnamed: 0,Error Sentence,Corrected Sentence
0,நூலகம் அவர்கள் சென்றாய்,அவர்கள் நூலகம் சென்றார்கள்
1,நூலகம் நாம் சென்றேன்,நாம் நூலகம் சென்றோம்
2,அவர்கள் சென்றோம் பள்ளிக்கூடம்,அவர்கள் பள்ளிக்கூடம் சென்றார்கள்
3,நூலகம் நான் சென்றார்,நான் நூலகம் சென்றேன்
4,நீ சென்றோம் நூலகம்,நீ நூலகம் சென்றாய்


In [13]:
data.columns = data.columns.str.strip()
print("Dataset Columns:", data.columns)
if 'Error Sentence' not in data.columns or 'Corrected Sentence' not in data.columns:
    raise ValueError("Dataset must contain 'Error Sentence' and 'Corrected Sentence' columns.")


Dataset Columns: Index(['Error Sentence', 'Corrected Sentence'], dtype='object')


In [21]:
# Split dataset
X = data['Error Sentence']
y = data['Corrected Sentence']

# Convert text data to numerical features
vectorizer = CountVectorizer(ngram_range=(1, 2))
X_vectorized = vectorizer.fit_transform(X)

# Train a logistic regression model
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)

# Function to predict subject-verb agreement errors using ML
def predict_errors_ml(sentence):
    processed_sentence = nlp(sentence)
    sentence_vector = vectorizer.transform([sentence])
    prediction = model.predict(sentence_vector)[0]
    return "Subject-Verb Agreement Errors", prediction

# Function to process a paragraph and return corrected version
def process_paragraph_ml(paragraph):
    sentences = paragraph.split('. ')
    corrected_paragraph = []
    for idx, sentence in enumerate(sentences, start=1):
        if sentence.strip():
            print(f"Sentence {idx}: {sentence.strip()}")
            result, suggestion = predict_errors_ml(sentence.strip())
            print(f"Error: {result}")
            print(f"Suggested output: {suggestion}")
            corrected_paragraph.append(suggestion)
    print("\nFinal Corrected Paragraph:")
    print(". ".join(corrected_paragraph) + ".")



In [22]:
# User interface for paragraphs
paragraph = input("Enter the paragraph: ")
process_paragraph_ml(paragraph)


Enter the paragraph: புத்தகம் அவர்கள் படித்தாய்
Sentence 1: புத்தகம் அவர்கள் படித்தாய்
Error: Subject-Verb Agreement Errors
Suggested output:  அவர்கள் புத்தகம் படித்தார்கள்

Final Corrected Paragraph:
 அவர்கள் புத்தகம் படித்தார்கள்.
