Regular Expressions
Task 1: HTML Cleaning (30 marks)

In [2]:
import re

def clean_html(input):
    with open(input, 'r', encoding='utf-8') as file:
        html_content = file.read()

    # Extract content within the <title> tag
    title_match = re.search(r'<title>(.*?)</title>', html_content, re.DOTALL)
    title_content = title_match.group(1).strip() if title_match else ""

    # Remove HTML comments
    html_content = re.sub(r'<!--.*?-->', '', html_content, flags=re.DOTALL)

    # Remove all <head> and <style> tags and their content, keeping <title>
    html_content = re.sub(r'<head>.*?</head>', '', html_content, flags=re.DOTALL)
    html_content = re.sub(r'<style.*?>.*?</style>', '', html_content, flags=re.DOTALL)

    # Remove all <script> tags and their content
    html_content = re.sub(r'<script.*?>.*?</script>', '', html_content, flags=re.DOTALL)

    # Remove all HTML tags, keeping only the content
    html_content = re.sub(r'<[^>]+>', '', html_content)

    # Remove extra indentation and whitespace characters (including extra line breaks)
    html_content = re.sub(r'[ \t]+', ' ', html_content)  # Replace extra spaces
    html_content = re.sub(r'\n\s*\n', '\n\n', html_content)  # Merge multiple blank lines into one

    # Keep necessary blank lines
    html_content = re.sub(r'\n{3,}', '\n\n', html_content)

    # Ensure no line starts with a space
    html_content = '\n'.join(line.lstrip() for line in html_content.splitlines())

    # Add the <title> content at the top and ensure there is a blank line between it and the body
    if title_content:
        html_content = title_content + "\n\n" + html_content.strip()

    # Output the processed content directly to the console
    print(html_content.strip())


# Call the function to process the HTML file and output to console
input = '/Users/mengrui/Desktop/input.html'
clean_html(input)


AIR6001 AI and Applications / MDS6105 Advanced AI

AIR6001 AI and Applications
MDS6105 Advanced AI

Spring 2023

This course is more relevant than ever given the recent surge in Artificial Intelligence technologies.
It introduces the fundamental concepts, history, and advanced technologies of AI in various applications, 
providing students with a comprehensive understanding of this rapidly evolving field. 
The course covers basic concepts of AI, such as intelligent agents, problem-solving, 
search, and first-order logic. It also delves into fundamental AI technologies, including regression,
pattern recognition, sequential modeling, data mining, deep learning, and supervised modeling. 
Additionally, the course explores technologies used in various AI applications, such as anomaly detection, 
edge AI, image processing, speech processing, natural language processing (including machine translation 
and dialogue modeling), scientific paper analysis, AI in healthcare, autonomous driving, and

Byte-Pair Encoding
Task 2: Implement Byte-Pair Encoding (BPE) (30 marks)

1. Implement a Byte-Pair Encoding (BPE) algorithm to learn subword tokens. Val- idate your implementation with the following setup. In every iteration, track the tokens with the most frequency and the occurrence.
• The text is “aaabdaaababc aa”
• The number of merges k = 2
• the expected tokenization output is “{aa}{ab}d{aa}{ab}{ab}c {aa}”’, where ‘{’ and ‘}’ indicate the new tokens.

In [33]:
import collections

# sub-word BPE algorithm
def get_stats(vocab):
    
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i + 1])] += freq
    return pairs

def merge_vocab(pair, vocab):
    # merge the most frequent charcter pairs
    bigram = ' '.join(pair)
    new_vocab = {}
    for word in vocab:
        new_word = word.replace(bigram, ''.join(pair))
        new_vocab[new_word] = vocab[word]
    return new_vocab

def bpe(vocab, num_merges):
    # do the specific merge times
    merges = []
    for i in range(num_merges):
        pairs = get_stats(vocab)
        if not pairs:
            break
        # find the most frequent charcter pairs
        best = max(pairs, key=pairs.get)
        merges.append(best)
        print(f"Iteration {i + 1}: Merging pair {best} with frequency {pairs[best]}")
        vocab = merge_vocab(best, vocab)
    return vocab, merges

# enter the text
text = "aaabdaaababc aa"
# create a vocab to seperate words, and make merge procudure more conveient
vocab = {' '.join(word): text.count(word) for word in text.split()}

# do the BPE procedure, number of merge times is 2
final_vocab, merges = bpe(vocab, num_merges=2)

# print out the last form of token
output = list(text)
for merge in merges:
    bigram = ''.join(merge)
    new_token = '{' + bigram + '}'
    text_str = ''.join(output)
    text_str = text_str.replace(bigram, new_token)
    output = list(text_str)

final_output = ''.join(output)

print("\nFinal tokenized output:")
print(final_output)


Iteration 1: Merging pair ('a', 'a') with frequency 7
Iteration 2: Merging pair ('a', 'b') with frequency 3

Final tokenized output:
{aa}{ab}d{aa}{ab}{ab}c {aa}


2. Apply BPE to the vocabulary of output.txt. In every iteration, track the tokens with the most frequency and the occurrence. Set the number of merges k = 20.


In [2]:
import collections

# sub-word BPE algorithm
def get_stats(vocab):
    # calculate most frequency char pairs
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i + 1])] += freq
    return pairs

def merge_vocab(pair, vocab):
    # merge most frequent char pairs
    bigram = ' '.join(pair)
    new_vocab = {}
    for word in vocab:
        new_word = word.replace(bigram, ''.join(pair))
        new_vocab[new_word] = vocab[word]
    return new_vocab

def bpe(vocab, num_merges):
    # do k times of merge 
    merges = []
    for i in range(num_merges):
        pairs = get_stats(vocab)
        if not pairs:
            break
        # find the most frequent char pairs
        best = max(pairs, key=pairs.get)
        merges.append(best)
        print(f"Iteration {i + 1}: Merging pair {best} with frequency {pairs[best]}")
        vocab = merge_vocab(best, vocab)
    return vocab, merges

# enter the file path
file_path = '/Users/mengrui/Desktop/output.txt'  

# read the file content
with open(file_path, 'r') as file:
    text = file.read().strip()
# create a vocab, seperate every char with a space in case to make merge more convenient
vocab = {' '.join(word): text.count(word) for word in text.split()}

# do the BPE algorithm with 20 times
final_vocab, merges = bpe(vocab, num_merges=20)

# print out the final form of token
output = list(text)
for merge in merges:
    tokens_to_merge = ['{' + token + '}' if len(token) > 1 else token for token in merge]
    replaced_token = ''.join(tokens_to_merge)
    bigram = ''.join(merge)
    new_token = '{' + bigram + '}'
    text_str = ''.join(output)
    text_str = text_str.replace(replaced_token, new_token)
    output = list(text_str)

final_output = ''.join(output)

print("\nFinal tokenized output:")
print(final_output)


Iteration 1: Merging pair ('i', 'n') with frequency 185
Iteration 2: Merging pair ('o', 'n') with frequency 79
Iteration 3: Merging pair ('r', 'e') with frequency 70
Iteration 4: Merging pair ('e', 'c') with frequency 61
Iteration 5: Merging pair ('a', 't') with frequency 60
Iteration 6: Merging pair ('in', 'g') with frequency 54
Iteration 7: Merging pair ('e', 'n') with frequency 53
Iteration 8: Merging pair ('t', 'u') with frequency 50
Iteration 9: Merging pair ('e', 'a') with frequency 48
Iteration 10: Merging pair ('A', 'I') with frequency 47
Iteration 11: Merging pair ('o', 'r') with frequency 46
Iteration 12: Merging pair ('ec', 'tu') with frequency 44
Iteration 13: Merging pair ('ectu', 're') with frequency 44
Iteration 14: Merging pair ('a', 'l') with frequency 38
Iteration 15: Merging pair ('i', 't') with frequency 38
Iteration 16: Merging pair ('a', 'n') with frequency 37
Iteration 17: Merging pair ('i', 'c') with frequency 37
Iteration 18: Merging pair ('i', 's') with freque

3. Among new tokens in Task 2.2, what are the three longest tokens. Do not list the component of the listed word. For example, if you have two tokens “speech” and “eech”, please only put “speech”. Discuss the reasons why those three tokens were selected.

Answer: The first longest tokens is "Lecture", the second longest tokens is "ing", while the third longest tokens is in(the highest frequency). The word in is selected probably because it is the most frequently used word pair in this text. And about ing it's a common suffix of adjectives. And "lecture" commonly appears in the Schedule part, so that's why these three are the most frequent word.

4. Set the number of merges k = 100 and discuss the results.

In [35]:
import collections

# sub-word BPE algorithm
def get_stats(vocab):
    # calculate most frequency char pairs
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i + 1])] += freq
    return pairs

def merge_vocab(pair, vocab):
    # merge most frequent char pairs
    bigram = ' '.join(pair)
    new_vocab = {}
    for word in vocab:
        new_word = word.replace(bigram, ''.join(pair))
        new_vocab[new_word] = vocab[word]
    return new_vocab

def bpe(vocab, num_merges):
    # do k times of merge 
    merges = []
    for i in range(num_merges):
        pairs = get_stats(vocab)
        if not pairs:
            break
        # find the most frequent char pairs
        best = max(pairs, key=pairs.get)
        merges.append(best)
        print(f"Iteration {i + 1}: Merging pair {best} with frequency {pairs[best]}")
        vocab = merge_vocab(best, vocab)
    return vocab, merges

# enter the file path
file_path = '/Users/mengrui/Desktop/output.txt'  

# read the file content
with open(file_path, 'r') as file:
    text = file.read().strip()
# create a vocab, seperate every char with a space in case to make merge more convenient
vocab = {' '.join(word): text.count(word) for word in text.split()}

# do the BPE algorithm with 20 times
final_vocab, merges = bpe(vocab, num_merges=100)

# print out the final form of token
output = list(text)
for merge in merges:
    tokens_to_merge = ['{' + token + '}' if len(token) > 1 else token for token in merge]
    replaced_token = ''.join(tokens_to_merge)
    bigram = ''.join(merge)
    new_token = '{' + bigram + '}'
    text_str = ''.join(output)
    text_str = text_str.replace(replaced_token, new_token)
    output = list(text_str)

final_output = ''.join(output)

print("\nFinal tokenized output:")
print(final_output)


Iteration 1: Merging pair ('i', 'n') with frequency 185
Iteration 2: Merging pair ('o', 'n') with frequency 79
Iteration 3: Merging pair ('r', 'e') with frequency 70
Iteration 4: Merging pair ('e', 'c') with frequency 61
Iteration 5: Merging pair ('a', 't') with frequency 60
Iteration 6: Merging pair ('in', 'g') with frequency 54
Iteration 7: Merging pair ('e', 'n') with frequency 53
Iteration 8: Merging pair ('t', 'u') with frequency 50
Iteration 9: Merging pair ('e', 'a') with frequency 48
Iteration 10: Merging pair ('A', 'I') with frequency 47
Iteration 11: Merging pair ('o', 'r') with frequency 46
Iteration 12: Merging pair ('ec', 'tu') with frequency 44
Iteration 13: Merging pair ('ectu', 're') with frequency 44
Iteration 14: Merging pair ('a', 'l') with frequency 38
Iteration 15: Merging pair ('i', 't') with frequency 38
Iteration 16: Merging pair ('a', 'n') with frequency 37
Iteration 17: Merging pair ('i', 'c') with frequency 37
Iteration 18: Merging pair ('i', 's') with freque

After 100 times of interation, the tokens in the text are quite likely to be in a word form, for exemple, AI, in, Lecture, etc. As the time of interations increase, frequently mentioned char pairs merged into together, and they are more likely to be a specific part of word. 