# Homework 10
#### Course Notes
**Language Models:** https://github.com/rjenki/BIOS512/tree/main/lecture17  
**Unix:** https://github.com/rjenki/BIOS512/tree/main/lecture18  
**Docker:** https://github.com/rjenki/BIOS512/tree/main/lecture19

## Question 1
#### Make a language model that uses ngrams and allows the user to specify start words, but uses a random start if one is not specified.

In [16]:
library(tidyverse)
library(tokenizers)
library(httr)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.6
[32m✔[39m [34mforcats  [39m 1.0.1     [32m✔[39m [34mstringr  [39m 1.6.0
[32m✔[39m [34mggplot2  [39m 4.0.1     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.4     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


#### a) Make a function to tokenize the text.

In [38]:
tokenize_text <- function (x) {
    tokenize_words(x, lowercase=TRUE, strip_punct=TRUE)[[1]]
}

#### b) Make a function generate keys for ngrams.

In [8]:
gen_key <- function(ngram, sep="\x1f") {
    paste(ngram, collapse=sep)
}

#### c) Make a function to build an ngram table.

In [18]:
build_ngram_table <- function(tokens, n, sep="\x1f") {
    if (length(tokens) < n) return (new.env(parent=emptyenv()))
    tbl <- new.env(parent=emptyenv())
    for (i in seq_len(length(tokens) - n + 1L)) {
        ngram <- tokens[i:(i + n - 2L)]
        next_word <- tokens[i+n-1L]
        key <- gen_key(ngram)
        counts <- if (!is.null(tbl[[key]])) tbl[[key]] else integer(0)
        if (next_word %in% names(counts)) {
            counts[[next_word]] <- counts[[next_word]] + 1L
        } else {
            counts[[next_word]] <- 1L
        }
        tbl[[key]] <- counts
    }
    tbl                                                             
}

#### d) Function to digest the text.

In [24]:
digest <- function(text, n) {
    tokens <- tokenize_text(text)
    build_ngram_table(tokens, n)
}

#### e) Function to digest the url.

In [20]:
url_digest <- function (url, n) {
    res <- GET(url)
    text <- content(res, as='text', encoding='UTF-8')
    digest(text, n)
}

#### f) Function that gives random start.

In [12]:
random_start <- function(tbl, sep='\x1f') {
    keys <- ls(envir = tbl, all.names = TRUE)
    if (length(keys) == 0) stop("No ngrams available. Need to digest some text.")
    picked <- sample(keys, 1)
    strsplit(picked, sep, fixed= TRUE)[[1]]
}

#### g) Function to predict the next word.

In [39]:
pred_next_word <- function(tbl, ngram, sep='\x1f') {
    key <- paste(ngram, collapse = sep)
    counts <- if(!is.null(tbl[[key]])) tbl[[key]] else integer(0)
    if (length(counts) == 0) return(NA_character_)
    sample(names(counts), size=1, prob=as.numeric(counts))
}

#### h) Function that puts everything together. Specify that if the user does not give a start word, then the random start will be used.

In [46]:
ngram_gen <- function(url, n, sep = '\x1f') {
    tbl <- url_digest(url, n) 
    function(start_words = NULL, length = 10L) {
        if ((is.null(start_words)) || length(start_words) != n - 1L) {
            print("Random Start will be used!")
            start_words <- random_start(tbl, sep=sep)
        }
        word_sequence <- start_words
        while(length(word_sequence) < length) {
            ngram <- tail(word_sequence, n - 1L)
            next_word <- pred_next_word(tbl, ngram, sep=sep)
            if (is.na(next_word)) break
            word_sequence <- c(word_sequence, next_word)
        }
        paste(word_sequence, collapse= " ")
    }
}

## Question 2
#### For this question, set `seed=2025`.
#### a) Test your model using a text file of [Grimm's Fairy Tails](https://www.gutenberg.org/cache/epub/2591/pg2591.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15. 
#### ii) Using n=3, with no start word, with length=15.

In [47]:
set.seed(2025)
gen <- ngram_gen('https://www.gutenberg.org/cache/epub/2591/pg2591.txt', 3)
gen(c("the", "king"), 15)
gen(length = 15)

[1] "Random Start will be used!"


#### b) Test your model using a text file of [Ancient Armour and Weapons in Europe](https://www.gutenberg.org/cache/epub/46342/pg46342.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15. 
#### ii) Using n=3, with no start word, with length=15.

In [48]:
gen <- ngram_gen('https://www.gutenberg.org/cache/epub/46342/pg46342.txt', 3)
gen(c("the", "king"), 15)
gen(length = 15)

[1] "Random Start will be used!"


#### c) Explain in 1-2 sentences the difference in content generated from each source.

The first source generates content with a folkish + medieval theme, mentioning things such as a king, a song, a lake, and a child. The second source generates content that is focused more on real-life medieval things such as a king commanding an army into a strategic position, and a pyx (silver container) containing a portrait of a European individual (much more technical and historical language).

## Question 3
#### a) What is a language learning model? 
#### b) Imagine the internet goes down and you can't run to your favorite language model for help. How do you run one locally?

a)            
It is a model that understands and predicts the next possible word in a sequence based on the previous text data that it was trained on. 

b)                   
Using Ollama as an example (which uses Docker to containerize and host the LM in a VM on the local machine), you would simply pull the local version of the LLM from your machine (assuming you have it installed) and run an API server that you would use to communicate with the LM using HTTP requests.

## Question 4
#### Explain what the following vocab words mean in the context of typing `mkdir project` into the command line. If the term doesn't apply to this command, give the definition and/or an example.
| Term | Meaning |  
|------|---------|
| **Shell** | A program allowing  you to interact with the functionality of a operating system. |
| **Terminal emulator** | The thing hosting the shell (essentially where the shell lives). Where we type mkdir project. |
| **Process** | Something running on the computer. So mkdir would be a process we begin. |
| **Signal** | Things we send to processes to tell them to do a certain activity. |
| **Standard input** | The thing that the shell will read from and send to the process. |
| **Standard output** | The thing that the process will output back to the shell. |
| **Command line argument** | mkdir project is an example of this. We pass this to standard output to start a process. |
| **The environment** | All the stuff in the scope of the process in runtime. |

## Question 5
#### Consider the following command `find . -iname "*.R" | xargs grep read_csv`.
#### a) What are the programs?
#### b) Explain what this command is doing, part by part.

a) find, xargs, grep are the programs we run.                  
b) The first part of the command is finding (find) a certain type of file in the current directory (.) based on filenames (-iname) which have the .R ending. It pipes this list of files to xargs which takes each file in the list and applies grep read_csv which applies the grep read_csv command to each file in the list.

## Question 6
#### Install Docker on your machine. See [here](https://github.com/rjenki/BIOS512/blob/main/lecture18/docker_install.md) for instructions. 
#### a) Show the response when you run `docker run hello-world`.
#### b) Access Rstudio through a Docker container. Set your password and make sure your files show up on the Rstudio server. Type the command and the output you get below.
#### c) How do you log in to the RStudio server?

a)                      
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
17eec7bbc9d7: Pull complete
Digest: sha256:f7931603f70e13dbd844253370742c4fc4202d290c80442b2e68706d8f33ce26
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash


b)                          
root@ValSweat:/mnt/c/Users/ddudh/PROJECTS# cd ./BIOS512-PROJECT
root@ValSweat:/mnt/c/Users/ddudh/PROJECTS/BIOS512-PROJECT# docker build . -t first_container
root@ValSweat:/mnt/c/Users/ddudh/PROJECTS/BIOS512-PROJECT# docker run -it \
 -e USERID=1001 \
 -e GROUPID=1001 \
 -e PASSWORD="cinema123#" \
 -p 8787:8787 \
 -v $(pwd):/home/rstudio/project1 \
 first_container
[+] Building 0.6s (10/10) FINISHED                                                                                    docker:default
 => [internal] load build definition from Dockerfile                                                                            0.0s
 => => transferring dockerfile: 810B                                                                                            0.0s
 => [internal] load metadata for docker.io/rocker/verse:4.4.2                                                                   0.5s
 => [auth] rocker/verse:pull token for registry-1.docker.io                                                                     0.0s
 => [internal] load .dockerignore                                                                                               0.0s
 => => transferring context: 2B                                                                                                 0.0s
 => [1/5] FROM docker.io/rocker/verse:4.4.2@sha256:5f4b1f351b2ffca3d7561e74ffa22c16e9ce585340bd402d16544cec60549262             0.0s
 => CACHED [2/5] RUN apt-get update && apt-get install -y git                                                                   0.0s
 => CACHED [3/5] RUN apt-get update  && apt-get install -y --no-install-recommends       ca-certificates       curl       gnup  0.0s
 => CACHED [4/5] RUN npm install -g @qwen-code/qwen-code@latest                                                                 0.0s
 => CACHED [5/5] WORKDIR /home/rstudio                                                                                          0.0s
 => exporting to image                                                                                                          0.0s
 => => exporting layers                                                                                                         0.0s
 => => writing image sha256:8796fea9610972779c8556921327814e51cb73be66a62d9c422ddf10495f0fe3                                    0.0s
 => => naming to docker.io/library/first_container                                                                              0.0s
[s6-init] making user provided files available at /var/run/s6/etc...exited 0.
[s6-init] ensuring user provided files have correct perms...exited 0.
[fix-attrs.d] applying ownership & permissions fixes...
[fix-attrs.d] done.
[cont-init.d] executing container initialization scripts...
[cont-init.d] 01_set_env: executing...
skipping /var/run/s6/container_environment/HOME
skipping /var/run/s6/container_environment/PASSWORD
skipping /var/run/s6/container_environment/RSTUDIO_VERSION
[cont-init.d] 01_set_env: exited 0.
[cont-init.d] 02_userconf: executing...
deleting the default user
creating new rstudio with UID 1001
useradd: warning: the home directory /home/rstudio already exists.
useradd: Not copying any file from skel directory into it.
Modifying primary group rstudio
Primary group ID is now custom_group 1001
[cont-init.d] 02_userconf: exited 0.
[cont-init.d] done.
[services.d] starting services
[services.d] done.
TTY detected. Printing informational message about logging configuration. Logging configuration loaded from '/etc/rstudio/logging.conf'. Logging to 'syslog'.


c)                   
You run the docker container and navigate on your browser to the port you exposed (in our case 8787 is opened on our side into the 8787 port opened in the container). Run localhost:8787 and you will be given the login screen. Depending on whether you changed the password, you will login in with the default username rstudio (can also change this) and your changed password or the one presented to you at runtime in the terminal.