# Assignment

I often need "fake data" to show people how to do data manipulation tasks with regular expressions or pandas. The
problem is that sometimes the data I generate on the web is too messy, and I get bogged down showing students how to
clean all of the data when some of it isn't representative of what I want. In this assignment you get a chance to help
me generate realistic fake data, all with llama 2!

For each question I will describe the data in natural language and you must write a function which queries llama 2 to
generate data in that format and adhere to the description I've written.


In [1]:
import os
import re
from llama_cpp import Llama
from llama_cpp.llama_types import *
from llama_cpp.llama_grammar import *

## Question 1

Generate for me a list of ten fictitious names, where the first name is a single word, and the last (family) name may be
(but doesn't have to be!) up to two words separated by a hyphen. Don't include titles, honorifics, or middle names. The
autograder will expect that you return a list[str] where each value in the list is a full name.


In [25]:
def generate_names() -> list[str]:
    model: Llama = Llama(model_path=os.environ["LLAMA_13B"], verbose=False, n_ctx=2048)
    results: list[str] = []
    # Prompt to make the names
    prompt = "Only imagine a name and last name like:"
    # Define the regex pattern to match the first two words
    pattern = r'(\w+\s\w+)'
    
    while len(results)<10:
        string_result =""
        for response in model.create_completion(prompt, max_tokens=16, stream=True):
            result = response["choices"][0]
            string_result+=result["text"]
            # print(result["text"], end="")

        # Use re.search to find the first match
        match = re.search(pattern, string_result.strip())  

        if match:
            output = match.group(1)
            results.append(output) 
    
    
    
    return results

In [26]:
# Invoke student code
from contextlib import redirect_stderr
import tempfile

with redirect_stderr( tempfile.TemporaryFile('wt') ) as error_catcher:
    results = generate_names()

# Verify the length
assert (
    len(results) == 10
), f"You did not return ten and only ten results, instead we got {len(results)}."


## Question 2

Generate for me a list of 5 things to do in your hometown (or mine if you prefer, Ann Arbor Michigan!). The key is that
these should all (a) start with a number and (b) be no more than three sentences long. So the following would be a good
item:

- 1\. Go to the Henry Ford Museum. The Henry Ford Museum has all sorts of wonderful exhibits for all ages. One
  particular highlight includes giants trains!

While the following would **not be good items** (the first item does not start a numbered list, the second item is not a
sentence as it doesn't end in punctuation, and the third item just goes on and on and on):

- A\. Go to the University of Michigan. The University of Michigan is a school with more than 50,000 students in Ann
  Arbor, MI. The University of Michigan is a public School.
- 2\. Visit the Detroit Eastern Market
- 3\. Visit Sleeping Bear Dunes. The dunes are located along the northwest coast of the Lower Peninsula of Michigan in
  Leelanau and Benzie counties near Traverse City. It covers a 35-mile-long stretch of Lake Michigan's eastern
  coastline, as well as North and South Manitou islands. This national park is known for its massive dunes, some of
  which are over 400 feet high. The area gets its name from the Native American legend of the Sleeping Bear. According
  to the story, a mother bear and her two cubs were trying to cross Lake Michigan from Wisconsin to escape a forest
  fire.


In [29]:
def generate_trip_recommendations() -> list[str]:
    model: Llama = Llama(model_path=os.environ["LLAMA_13B"], verbose=False, n_ctx=2048)
    results: list[str] = []
    # Prompt to make the names
    prompt = "I'm a turist in London, than I must visit:"
   
    
    while len(results)<5:
        string_result =""
        for response in model.create_completion(prompt, max_tokens=64, stream=True):
            result = response["choices"][0]
            string_result+=result["text"]
            # print(result["text"], end="")

        results.append(string_result) 
    
    
    
    return results


["\nBig Ben: it's the most known thing of London.\nTower Bridge: is one of London's most famous landmarks\nHyde Park and Regent's Park: two parks that are popular with tourists.\nLondon Eye: a large Ferris wheel on the south",
 '\nThe Tower of London is one of the most famous landmarks in England and arguably the world. The tower itself stands on the north bank of the River Thames in central London. The Tower of London has seen service as royal palace, prison, armoury, treasury, menagerie and',
 '\nWestminster Palace: it was founded as a priory of Benedictine monks during the reign of Edward the Confessor. In 1045 St Edward was buried there.\nTower Bridge: Tower Bridge is one of five London bridges now owned and maintained by the Bridge House Estates',
 "\n1- The Houses of Parliment\n2 - Trafalgar Square\n3 - The Tower of London\n4- Big Ben (I don't know the name of the clock)\n5- St.Paul Cathedral\n6- Hyde Park\n7- Buckingham Palace\n8",
 "\n*Tower of London *Buckingham Palace *Big Be

In [30]:
# Invoke student code
from contextlib import redirect_stderr
import re
import tempfile

with redirect_stderr( tempfile.TemporaryFile('wt') ) as error_catcher:
    results=generate_trip_recommendations()

# Verify length
assert (
    len(results) == 5
), f"You did not return five and only five results, instead we got {len(results)}."


## Question 3

Generate for me US-based addresses which have a person's name which usually appears on the first line, an optional
company name which often goes on the second line, a street address which has a number followed by some text description,
a city and state where the state is a two letter identifier and comes after the city name, and zip code which is a five
digit number (but as a string, since it could start with 0) followed by an optional hyphen and four more digits.

To make it easy for you to conform to this set of requirements, I created a simple class from the following example --
my mailing address!

> Dr. Christopher Brooks
>
> > School of Information, University of Michigan
> >
> > 105 S. State St.
> >
> > Ann Arbor, MI
> >
> > 48109-1285

Your function should return exactly 5 of these entries!

And, if you've gotten this far in the course, why not send me a postcard and introduce yourself? Everyone loves getting
mail!

(Don't forget to add **United States** if sending mail internationally, even though field is missing from the assignment
`MailingAddress` class.)


In [12]:
from dataclasses import dataclass


@dataclass
class MailingAddress:
    name: str  # Full name, e.g. Dr. Christopher Brooks
    business_name: (str | None)  # Optional business name, e.g. School of Information, University of Michigan
    street_number: int  # Numeric address value, e.g. 105
    street_text: str  # Street information other than numeric address, e.g. S. State St.
    city: str  # City name, e.g. Ann Arbor
    state: str  # State name, only two letters, e.g. MI for Michigan
    zip_code_short: str  # The first five digits of the zip code, e.g. 48109, as a string value, since it could start with 0
    zip_code_long: (str | None  )  # The extended zip code (optional) which is the full zip code, e.g. 48109-1285


def generate_addresses() -> list[MailingAddress]:
    model: Llama = Llama(model_path=os.environ["LLAMA_13B"], verbose=False, n_ctx=2048)
    results: list[MailingAddress] = []

    # Our prompt will just be a list of Mushroom questions in Aiken format
    prompt = '''
    name: Dr. Christopher Brooks
    business_name: School of Information, University of Michigan
    street_number: 105
    street_text: State St.
    city: Ann Arbor
    state: MI
    zip_code_short: 48109
    zip_code_long: 48109-1285

    name: Dr. Fred Brooks
    business_name: University of London
    street_number: 12
    street_text: Liverpool St.
    city: London
    state: UK
    zip_code_short: 56456
    zip_code_long: 56456-1285
    '''

    grammar = r'''
        root ::= address+
        address ::= name businessname streetnumber streettext city state zipcodeshort zipcodelong "\n"
        name ::= "name: " [A-Za-z ]* "\n"
        businessname ::= "business_name: " [A-Za-z ,]* "\n"
        streetnumber ::= "street_number: " [0-9]* "\n"
        streettext ::= "street_text: " [A-Za-z .]* "\n"
        city ::= "city: " [A-Za-z ]* "\n"
        state ::= "state: " [A-Za-z ]* "\n"
        zipcodeshort ::= "zip_code_short: " [0-9]* "\n"
        zipcodelong ::= "zip_code_long: " [0-9]* "\n"
    '''
    while len(results)<5:
        result = model.create_completion(prompt,
            grammar=LlamaGrammar.from_string(grammar=grammar), 
            stream=True, 
            max_tokens=128)
        string_result =""
        for item in result:
            string_result+=item['choices'][0]['text']
        
        data = (string_result.split('\n')[:8])
        address = MailingAddress(
            name= data[0].strip().split(':')[1].strip(),
            business_name= data[1].strip().split(':')[1].strip(),
            street_number= int(data[2].strip().split(':')[1].strip()),
            street_text= data[3].strip().split(':')[1].strip(),
            city= data[4].strip().split(':')[1].strip(),
            state= data[5].strip().split(':')[1].strip(),
            zip_code_short= data[6].strip().split(':')[1].strip(),
            zip_code_long= data[7].strip().split(':')[1].strip()
        )
        results.append(address)

    return results

generate_addresses()

from_string grammar:
root ::= root_2 
address ::= name businessname streetnumber streettext city state zipcodeshort zipcodelong [<U+000A>] 
root_2 ::= address root_2 | address 
name ::= [n] [a] [m] [e] [:] [ ] name_11 [<U+000A>] 
businessname ::= [b] [u] [s] [i] [n] [e] [s] [s] [_] [n] [a] [m] [e] [:] [ ] businessname_12 [<U+000A>] 
streetnumber ::= [s] [t] [r] [e] [e] [t] [_] [n] [u] [m] [b] [e] [r] [:] [ ] streetnumber_13 [<U+000A>] 
streettext ::= [s] [t] [r] [e] [e] [t] [_] [t] [e] [x] [t] [:] [ ] streettext_14 [<U+000A>] 
city ::= [c] [i] [t] [y] [:] [ ] city_15 [<U+000A>] 
state ::= [s] [t] [a] [t] [e] [:] [ ] state_16 [<U+000A>] 
zipcodeshort ::= [z] [i] [p] [_] [c] [o] [d] [e] [_] [s] [h] [o] [r] [t] [:] [ ] zipcodeshort_17 [<U+000A>] 
zipcodelong ::= [z] [i] [p] [_] [c] [o] [d] [e] [_] [l] [o] [n] [g] [:] [ ] zipcodelong_18 [<U+000A>] 
name_11 ::= [A-Za-z ] name_11 | 
businessname_12 ::= [A-Za-z ,] businessname_12 | 
streetnumber_13 ::= [0-9] streetnumber_13 | 
streettext_14 :

name: Dr John Brooks
business_name: School of Information, University of Michigan
street_number: 203
street_text: State St.
city: Ann Arbor
state: MI
zip_code_short: 48109
zip_code_long: 48109

name: Dr Robert Brooks
business_name: School of Information, University of Michigan
street_number: 320
street_text: State St.
city: Ann Arbor
state: MI
zip_code_short

[MailingAddress(name='Dr John Brooks', business_name='School of Information, University of Michigan', street_number=203, street_text='State St.', city='Ann Arbor', state='MI', zip_code_short='48109', zip_code_long='48109')]

In [None]:
# Invoke student code
from contextlib import redirect_stderr
import tempfile

with redirect_stderr( tempfile.TemporaryFile('wt') ) as error_catcher:
    results=generate_addresses()

# Verify length
assert (
    len(results) == 5
), f"You did not return five and only five results, instead we got {len(results)}."
