*De La Salle University – Dasmariñas* \
*College of Information and Computer Studies*

**S–CSIS312LA: Natural Language Processing (Laboratory)**

**Name:** Luis Anton P. Imperial \
**Program–Year–Section:** BCS32 \
**Date:** Friday, December 13, 2024

## Problem:

Create  a Python function that performs the following:

1. **Tokenization:** Tokenize a given input text, ignoring punctuation and handling case insensitivity.
2. **Sliding Window N-gram Generation:** Generate N-grams using a sliding window technique. The N-gram size (unigram, bigram, trigram, etc.) will be provided as an input.
3. **Frequency Count:** For each N-gram generated, count its frequency in the text.
4. **Filter N-grams by Frequency:** The function should return only those N-grams whose frequency is above a certain threshold.

## Requirements:

- **Tokenization:** You must split the input text into tokens (words), ignoring punctuation and converting all words to lowercase.
- **Sliding Window:** The N-grams should be generated using a sliding window approach. For example, if the text is "I love programming in Python", a bigram (2-gram) would generate:
 - 'I love'
 - 'love programming'
 - 'programming in'
 - 'in Python'
- **Frequency Count:** For each generated N-gram, count how many times it appears in the text.
- **Filter by Frequency:** The function should accept an additional parameter min_count which specifies the minimum frequency for an N-gram to be included in the result.

## Input:

- **`text`** (str): A string of text, potentially with punctuation. The text can be in mixed case.
- **`N`** (int): The size of the N-grams (i.e., 1 for unigrams, 2 for bigrams, 3 for trigrams, etc.).
- **`min_count`** (int): The minimum frequency threshold. Only N-grams that appear min_count times or more should be included in the result.

## Output:

- A dictionary with two keys:
 - **`'ngrams'`**: A list of N-grams that appear at least min_count times.
 - **`'frequency'`**: A dictionary that maps each N-gram to its frequency count.

The N-grams should be returned in a list of strings, sorted by their frequency in descending order.

Submit the .pynb and screenshots of sample output.

In [3]:
import re
from collections import Counter

def generate_ngrams(text, N, min_count):
    """
    Generate N-grams with a sliding window, count their frequencies, and filter by min_count.

    Args:
        text (str): Input text string.
        N (int): Size of the N-grams (e.g., 1 for unigrams, 2 for bigrams, etc.).
        min_count (int): Minimum frequency threshold to include an N-gram.

    Returns:
        dict: A dictionary containing:
            - 'ngrams': List of N-grams with frequency >= min_count.
            - 'frequency': Dictionary mapping each N-gram to its frequency.
    """
    # Tokenization: Convert text to lowercase and split into words ignoring punctuation
    tokens = re.findall(r'\b\w+\b', text.lower())

    # Sliding window: Generate N-grams
    ngrams = [" ".join(tokens[i:i+N]) for i in range(len(tokens) - N + 1)]

    # Count frequencies of each N-gram
    ngram_counts = Counter(ngrams)

    # Filter N-grams by minimum count threshold
    filtered_ngrams = {ngram: count for ngram, count in ngram_counts.items() if count >= min_count}

    # Sort N-grams by frequency in descending order
    sorted_ngrams = sorted(filtered_ngrams.items(), key=lambda x: x[1], reverse=True)

    # Prepare output
    result = {
        'ngrams': [ngram for ngram, _ in sorted_ngrams],
        'frequency': dict(sorted_ngrams)
    }

    return result

# Example usage
if __name__ == "__main__":
    texts = [
        "I love programming in Python. Programming in Python is fun!",
        "Natural language processing enables machines to understand human language.",
        "Hi! I'm Luis from La Salle Dasmariñas.",
        "Astro Bot didn't deserve to win The Game Awards.",
        "Merry Christmas and a happy New Year to everyone!"
    ]

    N = 2
    min_count = 1

    for i, text in enumerate(texts):
        print(f"\nText {i + 1}:")
        output = generate_ngrams(text, N, min_count)
        print("N-grams:", output['ngrams'])
        print("Frequency:", output['frequency'])


Text 1:
N-grams: ['programming in', 'in python', 'i love', 'love programming', 'python programming', 'python is', 'is fun']
Frequency: {'programming in': 2, 'in python': 2, 'i love': 1, 'love programming': 1, 'python programming': 1, 'python is': 1, 'is fun': 1}

Text 2:
N-grams: ['natural language', 'language processing', 'processing enables', 'enables machines', 'machines to', 'to understand', 'understand human', 'human language']
Frequency: {'natural language': 1, 'language processing': 1, 'processing enables': 1, 'enables machines': 1, 'machines to': 1, 'to understand': 1, 'understand human': 1, 'human language': 1}

Text 3:
N-grams: ['hi i', 'i m', 'm luis', 'luis from', 'from la', 'la salle', 'salle dasmariñas']
Frequency: {'hi i': 1, 'i m': 1, 'm luis': 1, 'luis from': 1, 'from la': 1, 'la salle': 1, 'salle dasmariñas': 1}

Text 4:
N-grams: ['astro bot', 'bot didn', 'didn t', 't deserve', 'deserve to', 'to win', 'win the', 'the game', 'game awards']
Frequency: {'astro bot': 1, 