
# Exercise #3.1: Regular Expressions

## Introduction
In this hands-on exercise, you are tasked with enhancing a Python program that currently uses regular expressions to identify integers and real numbers. Your objective is to expand its capabilities to recognize a broader range of string patterns, including prices, email addresses, and Python identifiers.

## Program Behavior
By default, the program reads lines from standard input, attempts to match each line against a list of regular expressions, and outputs the name of the pattern that matches or "unknown" if there is no match.

## Task Description
In the `main()` function of the program, there is the following list of tuples. Each tuple contains a regular expression and the name of the pattern it recognizes:

```python
patterns = [
    (r'^\d+$', 'integer'),
    (r'^\d+\.\d+$', 'real number'),
]
```

Your task is to add additional entries to this list to match the following types of strings:

### Price
Matches a price in SGD dollars. The number of cents is optional, but there must be two digits if the cents are shown. There may optionally be a comma separating thousands, millions, etc.

**Valid prices**:
- `$1`
- `$20`
- `$1.99`
- `$10.00`
- `$1500.50`
- `$2,000.99`
- `$1,234,567.89`

**Invalid prices**:
- `$1.9` (cents must have two digits if present)
- `$10,23.4` (improper comma placement)

### Email Address
Capturing all the rules for what makes a valid email address is complex, so we will use a simplified definition of a valid email address. This definition generally works just fine for extracting email addresses from documents.

The first part of the email address is the username portion, and it must not contain whitespace or the @ symbol. The username portion is followed by the @ symbol. After the @ symbol is the domain, which does not contain any whitespace or the @ symbol. The domain contains two or more non-empty components which are separated by periods. The final component must consist of only letters from the English alphabet.

**Valid email addresses**:
- `nsommer@smu.edu`
- `n.sommer@phdcs.smu.edu`
- `yippee_skippy@yee-haw.edu`
- `fun-times@Taylor.hall.smu.edu`

**Invalid email addresses**:
- `n@sommer@smu.edu` (multiple '@' symbols)
- `n sommer@smu.edu` (spaces not allowed)
- `nsommer@smu..edu` (consecutive periods not allowed)
- `nsommer@smu.edu-org` (hyphen in last domain extension)

### Python Identifiers
A python identifier is a name for a function, variable, etc. in a python program. A python identifier must contain only letters, digits, and underscores and the first character must be a letter or an underscore.

**Valid Python identifiers**:
- `x`
- `x1y2`
- `_hello`
- `funName`
- `FunName`

**Invalid Python identifiers**:
- `1x` (cannot start with a digit)
- `bad name` (spaces are not allowed)
- `!name` (special characters other than underscore are not allowed)

In [None]:
import re
import sys

def main():
    patterns = [
        (re.compile(r'^\d+$'), 'integer'),
        (re.compile(r'^\d+\.\d+$'), 'real number'),

        raise NotImplementedError("You need to implement this."),
        """you need to implement the following patterns:
            - a pattern verifying if the input is a valid price
            - a pattern verifying if the input is a email address
            - a pattern verifying if the input is a python identifier
        """
    ]



    print("Reading from standard input. Enter lines to match, or press 'quit' to exit.")

    while True:
        try:
            
            input_line = input("Enter a string: ")
            if input_line.lower() == "quit" or input_line.lower() == "":
                print("Exiting program.")
                break

            matched = False
            for pattern, name in patterns:
                if pattern.match(input_line):
                    print(f"{input_line}: {name}")
                    matched = True
                    break
            if not matched:
                print(f"{input_line}: unknown")

            sys.stdout.flush()

        except Exception as e:
            print(f"Exiting program.")
            break

if __name__ == "__main__":
    main()


# Exercise #3.2: Building N-gram Language Models with NLTK

## Introduction

In this exercise, you are tasked with implementing N-gram language models using the NLTK library and the Reuters corpus. This exercise will help you understand how to process text, build N-gram models, and predict the next word in a sequence.


## Task Description

Below is the basic code structure for building N-gram language models with NLTK. Your tasks are to complete the following implementations:

### Task 1: Create N-Grams
- Extract Ngrams from the list of words using NLTK. This involves taking consecutive words from the corpus to form a set.

### Task 2: Build N-Grams Model

- Count Frequency: Develop a frequency model for Ngrams. This step involves counting the occurrences of each Ngrams within the corpus.
- Calculate Probabilities: Convert these frequency counts into probabilities. Normalize the counts of Ngrams that end with each possible subsequent word by the total counts of Ngrams that start with the previous words.

### Task 3: Implement Prediction Function
- Implement a function predict_next_word() that predicts the next word based on the input words using your N-Grams model.

## 

In [3]:
# Import necessary libraries
import nltk
from nltk import bigrams, trigrams
from nltk.corpus import reuters
from collections import defaultdict
import random


# Download necessary NLTK resources
nltk.download('reuters')
nltk.download('punkt')

[nltk_data] Downloading package reuters to
[nltk_data]     /home/jingguiliang/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /home/jingguiliang/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [12]:
# Tokenize the text
words = nltk.word_tokenize(' '.join(reuters.words()))


def UniGramModel(words):

    """Task 1: Implement unigram using NLTK"""
    uni_grams = words

    """Task 2: Build a unigram model"""
    model = defaultdict(lambda: 0)

    """Task 3: Count frequency of each word and transform the counts into probabilities"""
    for word in uni_grams:
        model[word] += 1

    total_count = float(sum(model.values()))
    for word in model:
        model[word] /= total_count

    """Task 4: Predict the next word using the unigram model"""
    def predict_next_word(model=model):
        """
        Predicts the next word using the trained unigram model.
        Returns:
        str: The predicted next word.
        """
        return random.choices(list(model.keys()), list(model.values()))[0]
    
    input_word = input("Enter the previous word: ")
    print("The input words are:", input_word)
    
    input_word = input_word.strip().lower().split()
    print("Predicted Next Word using Unigram Model:", predict_next_word())


def BiGramModel(words):
    
    
    raise NotImplementedError("You need to implement Task 1~4.")
    """Task 1: Implement trigram using NLTK"""
    
    """Task 2: Build a bigram model"""
    
    """Task 3: Count frequency of co-occurrence and transform the counts into probabilities"""

    """Task 4: Predict the next word using the bigram model"""

    def predict_next_word(w1, model):
        """
        Predicts the next word based on the previous word using the trained bigram model.
        Args:
        w1 (str): The first word.

        Returns:
        str: The predicted next word.
        """
        raise NotImplementedError("You need to implement this.")
    
    input_word = input("Enter the previous word: ")
    print("The input words are:", input_word)
    
    input_word = input_word.strip().lower().split()
    print("Predicted Next Word using Bigram Model:", predict_next_word(input_word[-1])) 



def TriGramModel(words):

    raise NotImplementedError("You need to implement Task 1~4.")
    """Task 1: Implement trigram using NLTK"""

    """Task 2: Build a trigram model"""

    """Task 3: Count frequency of co-occurrence and transform the counts into probabilities"""

    """Task 4: Predict the next word using the trigam model"""
    def predict_next_word(w1, w2, model=model):
        """
        Predicts the next word based on the previous two words using the trained trigram model.
        Args:
        w1 (str): The first word.
        w2 (str): The second word.

        Returns:
        str: The predicted next word.
        """
        raise NotImplementedError("You need to implement this.")

    
    input_word = input("Enter the previous word: ")
    print("The input words are:", input_word)
    
    input_word = input_word.strip().lower().split()
    print("Predicted Next Word using Trigram Model:", predict_next_word(input_word[-2], input_word[-1]))



# Test the unigram model
UniGramModel(words)

# Test the bigram model
BiGramModel(words)

# Test the trigram model
TriGramModel(words)


The input words are: ['the', 'news', 'is']
Predicted Next Word using Bigram Model: expected
The input words are: ['the', 'stock', 'of']
Predicted Next Word using Bigram Model: the
