Skip to content

MumukshTayal/Statistical-Language-Modeling-N-grams

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Statistical Language Modeling Using N-grams

Course Project for MA 202 [Probability and Statistics]


Abstract

Using Natural Language Processing, the model predicts the most probable next word and outputs the correctness of an input English sentence. To achieve the optimum accuracy, a large reliable dataset or corpus is extracted from Wikipedia, preprocessed, and then analyzed before using it to train the model. Analyzing the dataset and its visualization can be an insightful technique to understand the corpus before using it for the model's training. Choosing an appropriate model for any problem is a crucial step. In our case, using a trigram model to train the data proved to be the best trade-off. This trained model is finally used in the code to predict the next word and find the perplexity of a given sentence based on the trigram model.

Problem Statement

Computers were once thought of as “dumb terminals,” and human interactions were based on the principle of “garbage in, garbage out.” Computers could only communicate in sophisticated hand-coded rules. Natural Language Processing bridges the gap between humans and computers by enabling humans to interact with computers in human-developed languages. It can have various use-cases such as voice assistants, speech recognition, computer-assisted coding, and word & sentence prediction. The boundless possibilities in NLP, yet to be explored, motivate us to work in this field.

Interface

image

Link to the Interface

Requirements

nltk

License

The code is licenced under the MIT license and free to use by anyone without any restrictions.


Created with ❤️ by Mumuksh Tayal

About

This is a course project on Statistical Language Modeling Using N-Grams

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages