# Homework 10
#### Course Notes
**Language Models:** https://github.com/rjenki/BIOS512/tree/main/lecture17  
**Unix:** https://github.com/rjenki/BIOS512/tree/main/lecture18  
**Docker:** https://github.com/rjenki/BIOS512/tree/main/lecture19

## Question 1
#### Make a language model that uses ngrams and allows the user to specify start words, but uses a random start if one is not specified.

#### a) Make a function to tokenize the text.

In [4]:
tokenize <- function(text) {
  text <- tolower(text)
  
  text <- gsub("[[:punct:]]", " ", text)
  
  tokens <- unlist(strsplit(text, "\\s+"))
  
  tokens <- tokens[tokens != ""]
  
  return(tokens)
}

#### b) Make a function generate keys for ngrams.

In [7]:
generate_ngram_keys <- function(tokens, n) {
  if (length(tokens) < n) {
    return(character(0))
  }
  
  keys <- c()
  for (i in 1:(length(tokens) - n + 1)) {
    key <- paste(tokens[i:(i + n - 2)], collapse = " ")
    keys <- c(keys, key)
  }
  return(keys)
}

#### c) Make a function to build an ngram table.

In [10]:
build_ngram_table <- function(tokens, n) {
  if (length(tokens) < n) {
    return(data.frame(
      key = character(),
      next_word = character(),
      count = integer()
    ))
  }

  keys <- generate_ngram_keys(tokens, n)
  next_words <- tokens[n:length(tokens)]
  tab <- as.data.frame(
    table(key = keys, next_word = next_words),
    stringsAsFactors = FALSE
  )
  names(tab) <- c("key", "next_word", "count")
  
  return(tab)
}

#### d) Function to digest the text.

In [13]:
digest_text <- function(text, n) {
  if (length(text) > 1) {
    text <- paste(text, collapse = " ")
  }
  tokens <- tokenize(text)
  ngram_tab <- build_ngram_table(tokens, n)
  return(ngram_tab)
}

#### e) Function to digest the url.

In [16]:
digest_url <- function(url, n) {
  lines <- readLines(url, warn = FALSE)
  text <- paste(lines, collapse = " ")
  ngram_tab <- digest_text(text, n)
  return(ngram_tab)
}

#### f) Function that gives random start.

In [19]:
random_start <- function(ngram_table) {
  if (nrow(ngram_table) == 0) {
    stop("The ngram table is empty. Nothing to sample from.")
  }
  start <- sample(unique(ngram_table$key), 1)
  
  return(start)
}

#### g) Function to predict the next word.

In [22]:
predict_next_word <- function(current_key, ngram_table) {
  rows <- ngram_table[ngram_table$key == current_key, ]
  if (nrow(rows) == 0) {
    return(NA)
  }
  probs <- rows$count / sum(rows$count)
  next_word <- sample(rows$next_word, size = 1, prob = probs)
  return(next_word)
}

#### h) Function that puts everything together. Specify that if the user does not give a start word, then the random start will be used.

In [25]:
generate_text <- function(ngram_table, n, length = 20, start = NULL) {
  if (is.null(start)) {
    start <- random_start(ngram_table)
  }
  current_tokens <- unlist(strsplit(start, " "))

  output <- current_tokens
  for (i in seq_len(length)) {
    if (length(current_tokens) < (n - 1)) {
      break
    }
    
    key <- paste(
      current_tokens[(length(current_tokens) - n + 2):length(current_tokens)],
      collapse = " "
    )
      
    next_word <- predict_next_word(key, ngram_table)
    if (is.na(next_word)) {
      break
    }
    output <- c(output, next_word)
    current_tokens <- c(current_tokens, next_word)
  }
  return(paste(output, collapse = " "))
}

## Question 2
#### For this question, set `seed=2025`.
#### a) Test your model using a text file of [Grimm's Fairy Tails](https://www.gutenberg.org/cache/epub/2591/pg2591.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15. 
#### ii) Using n=3, with no start word, with length=15.

In [17]:
set.seed(2025)
grimm_trigrams <- digest_url("https://www.gutenberg.org/cache/epub/2591/pg2591.txt", n = 3)

text_the_king <- generate_text(grimm_trigrams, n = 3, length = 15, start = "the king")

text_random_start <- generate_text(grimm_trigrams, n = 3, length = 15)

text_the_king
text_random_start

#### b) Test your model using a text file of [Ancient Armour and Weapons in Europe](https://www.gutenberg.org/cache/epub/46342/pg46342.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15. 
#### ii) Using n=3, with no start word, with length=15.

In [29]:
set.seed(2025)

armour_trigrams <- digest_url(
  "https://www.gutenberg.org/cache/epub/46342/pg46342.txt",
  n = 3
)

text_the_king_armour <- generate_text(
  armour_trigrams,
  n = 3,
  length = 15,
  start = "the king"
)

text_random_armour <- generate_text(
  armour_trigrams,
  n = 3,
  length = 15
)

text_the_king_armour
text_random_armour

#### c) Explain in 1-2 sentences the difference in content generated from each source.

The text generated from Grimm’s Fairy Tales sounds more like a story, with characters, actions, and narrative elements (“the king… lying there… wedding”). In contrast, the text from Ancient Armour and Weapons in Europe is much more technical and descriptive, referring to armor, weapons, figures, and historical terms, so the generated sentences feel more like historical documentation than storytelling.

## Question 3
#### a) What is a language learning model? 

A language model is basically a program that tries to understand patterns in text so it can predict what words are likely to come next. You give it a bunch of text, and it “learns” things like which words tend to appear together, how sentences usually flow, and what structure the language has. Once it’s trained, you can ask it to generate new text that follows those same patterns. In short, it’s like teaching a computer to “guess the next word” over and over until it can write sentences that sound like they came from the original source.

#### b) Imagine the internet goes down and you can't run to your favorite language model for help. How do you run one locally?

If I want to run a language model locally, the main idea is just to keep everything on my own machine instead of connecting to an online service. That usually means downloading whatever text I want to train on, building the n-gram table myself (like we did in this homework), and then generating text directly in R or Python. With the n-gram approach, everything is self-contained, no internet needed. For larger modern models you’d download the model files beforehand, install the right libraries, and run them offline, but the concept is the same: once you have the data and the code on your computer, you don’t have to rely on any external language model server.

## Question 4
#### Explain what the following vocab words mean in the context of typing `mkdir project` into the command line. If the term doesn't apply to this command, give the definition and/or an example.
| Term | Meaning |  
|------|---------|
| **Shell** | The shell reads my command (mkdir project) and runs it. |
| **Terminal emulator** | The terminal emulator is just the window where I type the command. |
| **Process** | A process is a running program; mkdir becomes a short-lived process when executed. |
| **Signal** | A signal is a way to interrupt or control a process (e.g., Ctrl+C), but it doesn’t really apply to mkdir. |
| **Standard input** | Standard input is data sent into a program; mkdir doesn’t read any input from stdin. |
| **Standard output** | Standard output is where a program prints messages; mkdir usually prints nothing unless there’s an error. |
| **Command line argument** | A command line argument is extra information after the command; in mkdir project, project is the argument. |
| **The environment** | The environment is the set of variables the shell gives to programs (like PATH); mkdir uses it to locate the command. |

## Question 5
#### Consider the following command `find . -iname "*.R" | xargs grep read_csv`.
#### a) What are the programs?

find

xargs

grep

#### b) Explain what this command is doing, part by part.

find . -iname "*.R"
→ The find program searches in the current directory (.) for files whose names end in .R, ignoring case.

|
→ The pipe takes the list of matching files and sends them to the next program.

xargs grep read_csv
→ xargs takes the file names from find and passes them as arguments to grep.
→ grep read_csv searches inside those R files for the text read_csv.

## Question 6
#### Install Docker on your machine. See [here](https://github.com/rjenki/BIOS512/blob/main/lecture18/docker_install.md) for instructions. 
#### a) Show the response when you run `docker run hello-world`.

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

#### b) Access Rstudio through a Docker container. Set your password and make sure your files show up on the Rstudio server. Type the command and the output you get below.

C:\Users\15633\Desktop\BIOS_512\assignment 10>docker run -d -p 8787:8787 -e PASSWORD=199831698Frankff -v "%cd%":/home/rstudio rocker/rstudio
Unable to find image 'rocker/rstudio:latest' locally
latest: Pulling from rocker/rstudio
3665120d345d: Pull complete
5d246ec925db: Pull complete
3c7cdccc4be7: Pull complete
890065c4c99d: Pull complete
08e74fd5985d: Pull complete
999e4b8f7ed8: Pull complete
e4b9e87bb831: Pull complete
2c9ba66d5dbe: Pull complete
62f215ca34c6: Pull complete
d923cf803a12: Pull complete
4b3ffd8ccb52: Pull complete
2a63ed8b2250: Pull complete
9c1a4a0706b7: Pull complete
b71e78fefbbb: Pull complete
39038e16d1ba: Pull complete
971ba7cf0d8a: Pull complete
191985778909: Pull complete
664fb1818bbb: Pull complete
Digest: sha256:9f85211a666fb426081a6f5a01f9f9f51655262258419fa21e0ce38a5afc78d8
Status: Downloaded newer image for rocker/rstudio:latest
ca9bb4739fc47f6848d5e746076547c0cae5d6383ac88bb8852e18ea086e18e8

#### c) How do you log in to the RStudio server?

To log in to the RStudio server, I open a browser and go to
http://localhost:8787.
The username is rstudio, and the password is whatever I set in the Docker command (in my case, 199831698Frankff).