<a href="https://colab.research.google.com/github/RajarajachozhanVK/RajarajachozhanVK/blob/main/N_Grams_for_Word_Document.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **N-Grams for Word Document**
1. Learning Objectives

    Implement n-gram in Python from scratch and using NLTK

    Understand n-grams and their importance

    Know the applications of n-grams in NLP

    https://www.analyticsvidhya.com/blog/2021/09/what-are-n-grams-and-how-to-implement-them-in-python/

2. Language Model

    Models that assign probabilities to sequences of words are called language models(LMs).
    The simplest language model that assigns probabilities to sentences and sequences of words is the n-gram.

3. What is N-Grams(n-grams)?

N-grams are continuous sequences of n words or symbols, or tokens extracted from text for language processing and analysis. An n-gram can be as short as a single word (unigram) or as long as multiple words (bigram, trigram, etc.). These n-grams capture the contextual information and relationships between words in a given text.
Examples:

    Unigrams (1-grams): is a single word sequence of words, e.g., “please” or “ turn” or “cat” or “dog”
    Bigrams (2-grams): is a two-word sequence of words, e.g., “natural language” or “deep learning”
    Trigrams (3-grams): is a three-word sequence of words, e.g., “machine learning model” or “data science approach”
    4-grams, 5-grams, etc.: Sequences of four, five, or more consecutive words.
  

4. Significance of N-grams in NLP

    Capturing Context and Semantics: N-grams help capture the contextual information and semantics within a sequence of words, providing a more nuanced understanding of language.

    Improving Language Models: In language modeling tasks, N-grams contribute to building more accurate and context-aware models, enhancing the performance of applications such as machine translation and speech recognition.

    Enhancing Text Prediction: N-grams are essential for predictive text applications, aiding in the prediction of the next word or sequence of words based on the context provided by the preceding N-gram.

    Information Retrieval: In information retrieval tasks, N-grams assist in matching and ranking documents based on the relevance of N-gram patterns.

    Feature Extraction: N-grams serve as powerful features in text classification and sentiment analysis, capturing meaningful patterns that contribute to the characterization of different classes or sentiments.

5. Applications of N-grams in NLP

N-grams in NLP find applications across a wide range of domains, including:

    Sentiment analysis: Analyzing n-grams helps in understanding the sentiment expressed in text by capturing the context of words and phrases.
    Named Entity Recognition (NER): NER systems utilize n-grams to identify and classify named entities such as names, locations, organizations, dates, and more.
    Text classification: N-grams are used as features in machine learning models for classifying text into predefined categories.
    Topic modeling: N-grams aid in uncovering latent topics within a collection of documents, enabling clustering and categorization.
    Language generation: N-grams provide the foundation for generating realistic and coherent text, such as in chatbots or language translation systems.

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
# An n-gram model is a type of probabilistic language model based on the frequency of n-grams (contiguous sequences of n items) in a given text.
from collections import defaultdict
from nltk import ngrams
# Function to read text from a file
def read_text_from_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    return text

# Function to generate n-grams
def generate_ngrams(tokens, n):
    ngrams_list = ngrams(tokens, n)
    return ngrams_list

# Function to compute n-gram model
def compute_ngram_model(file_path, n):
    # Read text from file
    text = read_text_from_file(file_path)

    # Tokenize the text into words
    tokens = nltk.word_tokenize(text)

    # Generate n-grams
    ngrams_list = generate_ngrams(tokens, n)

    # Count the occurrences of each n-gram
    ngram_counts = defaultdict(int)
    for ngram in ngrams_list:
        ngram_counts[ngram] += 1

    # Display the n-gram counts
    for ngram, count in ngram_counts.items():
        print(f"{ngram}: {count} occurrences")

# Example usage with a file named 'sample.txt' and bigrams (n=2)
#file_path = 'sample.txt'
file_path = 'ngram_task5.txt'
n = 2
compute_ngram_model(file_path, n)

('N-grams', 'are'): 1 occurrences
('are', 'continuous'): 1 occurrences
('continuous', 'sequences'): 1 occurrences
('sequences', 'of'): 1 occurrences
('of', 'n'): 1 occurrences
('n', 'words'): 1 occurrences
('words', 'or'): 1 occurrences
('or', 'symbols'): 1 occurrences
('symbols', ','): 1 occurrences
(',', 'or'): 1 occurrences
('or', 'tokens'): 1 occurrences
('tokens', 'extracted'): 1 occurrences
('extracted', 'from'): 1 occurrences
('from', 'text'): 1 occurrences
('text', 'for'): 1 occurrences
('for', 'language'): 1 occurrences
('language', 'processing'): 1 occurrences
('processing', 'and'): 1 occurrences
('and', 'analysis'): 1 occurrences
('analysis', '.'): 1 occurrences
('.', 'An'): 1 occurrences
('An', 'n-gram'): 1 occurrences
('n-gram', 'can'): 1 occurrences
('can', 'be'): 1 occurrences
('be', 'as'): 1 occurrences
('as', 'short'): 1 occurrences
('short', 'as'): 1 occurrences
('as', 'a'): 1 occurrences
('a', 'single'): 1 occurrences
('single', 'word'): 1 occurrences
('word', '('): 