In [None]:
#Question 1
Make a language model that uses ngrams and allows the user to specify start words, but uses a random start if one is not specified.

In [1]:
#a Make a function to tokenize the text
library(tokenizers)

tokenize_text <- function(text) {
  tokens <- tokenizers::tokenize_words(
    text,
    lowercase = TRUE,
    strip_punct = TRUE
  )[[1]]  
  
  return(tokens)
}

#b Make a function generate keys for ngrams
key_from <- function(ngram, sep = "\x1f") {
    paste(ngram, collapse = sep)
}


#c Function to build an ngram table
build_ngram_table <- function(tokens, n, sep = "\x1f") {
  if (length(tokens) < n) {
    return(new.env(parent = emptyenv()))
  }
  
  tbl <- new.env(parent = emptyenv())
  

  for (i in seq_len(length(tokens) - n + 1L)) {
    ngram <- tokens[i:(i + n - 2L)]
    next_word <- tokens[i + n - 1L]
    key <- key_from(ngram, sep = sep)
    counts <- if (!is.null(tbl[[key]])) tbl[[key]] else integer(0)
    if (next_word %in% names(counts)) {
      counts[[next_word]] <- counts[[next_word]] + 1L
    } else {
      counts[[next_word]] <- 1L
    }
    tbl[[key]] <- counts
  }
  
  tbl
}


#d Function to digest the text
digest_text <- function(text, n, sep = "\x1f") {
  tokens <- tokenize_text(text)
  build_ngram_table(tokens, n = n, sep = sep)
}


#e Function to digest the url
digest_url <- function(url, n, sep = "\x1f") {
  res <- httr::GET(url)
  txt <- httr::content(res, as = "text", encoding = "UTF-8")
  digest_text(txt, n = n, sep = sep)
}


#f Function that gives random start
random_start <- function(tbl, sep = "\x1f") {
  keys <- ls(envir = tbl, all.names = TRUE)
  if (length(keys) == 0L) {
    stop("No n-grams available. Digest text first.")
  }
  
  picked <- sample(keys, 1)
  strsplit(picked, sep, fixed = TRUE)[[1]]
}

#g Function to predict next word
predict_next_word <- function(tbl, ngram, sep = "\x1f") {
  key <- key_from(ngram, sep = sep)
  counts <- if (!is.null(tbl[[key]])) tbl[[key]] else integer(0)
  
  if (length(counts) == 0L) {
    return(NA_character_)
  }
  
  sample(
    names(counts),
    size = 1,
    prob = as.numeric(counts)
  )
}

#h  Function that puts everything together. Specify that if the user doesn't give a start word, then a random word will be used. 
make_ngram_generator <- function(tbl, n, sep = "\x1f") {
  force(tbl)
  n <- as.integer(n)
  force(sep)
  
  function(start_words = NULL, length = 10L) {
    if (is.null(start_words) || length(start_words) != (n - 1L)) {
      start_words <- random_start(tbl, sep = sep)
    }
    
    word_sequence <- start_words
    for (i in seq_len(max(0L, length - length(word_sequence)))) {
      ngram <- tail(word_sequence, n - 1L)
      next_word <- predict_next_word(tbl, ngram, sep = sep)
      
      if (is.na(next_word)) break
      
      word_sequence <- c(word_sequence, next_word)
    }
    
    paste(word_sequence, collapse = " ")
  }
}


In [2]:
#Question 2
set.seed(2025)

#a Test your model using text file of Grimm's Fairy Tales.
grimms_url <- "https://www.gutenberg.org/files/5314/5314-0.txt"


n <- 3

# Digest Grimm's Fairy Tales into an n-gram table
tbl3_grimm <- digest_url(grimms_url, n = n)

# Make the generator
gen3_grimm <- make_ngram_generator(tbl3_grimm, n = n)



# i) Start words "the king", length 15
output_i <- gen3_grimm(start_words = c("the", "king"), length = 15)
cat("i) n=3, start='the king', length=15:\n", output_i, "\n\n")

output_ii <- gen3_grimm(length = 15)
cat("ii) n=3, random start, length=15:\n", output_ii, "\n")

#Explain the difference in content generated by each source.
#By specifying start words like, "the king", the model then begins to generate text from the specified context. However, when no 
#start words is given, the model will choose a n-gram. This means that the generated text will begin from a different
#part and will produce content from potentially random parts of the story. 


ERROR: Error in digest_url(grimms_url, n = n): could not find function "digest_url"


In [None]:
#Question 3

#a What is a language learning model?
#A LLM is a type of machine learning model that predicts the probability of word sequences and has the capacity to generate human language. 
#LLM takes the previous input and then calculates the probability distribution over words. 
#There is also a version called the Markov model that predicts future states by considering the current state rather than previous input.

#b How do you run a llm locally?
#You can run a language learning model locally using a tool like Ollama. Ollama can be downloaded through Homebrew on the terminal and then will allow you to talk to the langauge
#model over HTTP. 

In [None]:
#Question 4
Explain what the following vocab words mean in the context of typing mkdir project into the command line. If the term doesn't apply to this command, give the definition and/or an example.

Term	               Meaning
Shell	The command mkdir project is processed through your shell
Terminal emulator	You are entering the command mkdir project into the terminal emulator
Process	Something running on your computer
Signal	Things we send to processes to tell them to do something
Standard input	(stdin) "mkdir project" is the stream of data that the project receives
Standard output	(stdout)The text confirming the creation of the directory
Command line argument	"mkdir project"
The environment	Everything that is visible while terminal is running

In [None]:
#Question 5
#Consider the following command find . -name "*.R" | xargs grep read_csv
#a What are the programs?
#find, xargs, and grep

#b Explain what it's doing, step by step. 
#1. It starts in the current directory, searches, and then ouputs files ending in .R
#2. Sends the list produced by finds
#3. Xargs takes the file names and appends them to grep
#4. Reads them to find any lines including read_csv
#5. prints those

In [None]:
#Question 6
#Install docker 

#a Show the response when you run docker
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
198f93fd5094: Pull complete 
Digest: sha256:f7931603f70e13dbd844253370742c4fc4202d290c80442b2e68706d8f33ce26
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (arm64v8)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

#Access Rstudio through docker
#How do you log in to the RStudio server?
http://localhost:8787
username: rstudio
password: rstudio
