In [None]:
# Check GPU
!nvidia-smi

Description:
This cell contains a command that checks the GPU status of the system. The ! at the beginning of the command indicates that this is a shell command being executed from within the notebook. Specifically, nvidia-smi is a command-line utility that provides detailed information about the NVIDIA GPU(s) on the system, including utilization, temperature, memory usage, and more. This command is often used to verify that the system has an NVIDIA GPU and to monitor its status.

Let's move on to the content of the second cell.

In [None]:
# check is CUDA is installed
!nvcc --version

Description:
This cell checks if CUDA (Compute Unified Device Architecture) is installed on the system. CUDA is a parallel computing platform and application programming interface (API) model created by NVIDIA. It allows developers to use NVIDIA GPUs for general-purpose computing.

The ! symbol again indicates that a shell command is being executed from within the notebook. The command nvcc --version checks the version of the NVIDIA CUDA Compiler (nvcc). By executing this command, the user can verify that CUDA is installed and determine its version.

Let's proceed to the content of the third cell.

In [None]:
# update ubuntu packages
# This part is optional but in my opinion necessary as not being up to date can cause unexplainable errors later
!sudo apt update && sudo apt upgrade -y -y

Description:
This cell contains commands to update the Ubuntu packages on the system.

The !sudo apt update command retrieves the list of available packages and their versions from the repositories. It essentially updates the package database to know which packages have new versions available for upgrade.
The && symbol is a bash operator that allows chaining of commands. The second command will only execute if the first one succeeds.
The !sudo apt upgrade -y -y command upgrades the installed packages to their latest versions. The -y flag is used to automatically answer 'yes' to any prompts, allowing the upgrade to proceed without user intervention.
The comments in the cell indicate that while this step is optional, it's recommended to ensure that the system is up-to-date, reducing the risk of potential issues caused by outdated packages.

Let's examine the content of the fourth cell.

In [None]:
# paperspace already has the 'huggingface_hub' library installed. If using notebook outside of paperspace, install library with ->
# -> !pip install huggingface_hub
from huggingface_hub import snapshot_download

token = "Delete this text and enter token here inbetween the quotation marks" # Get Read Token from huggingface account

# creates a path called "Llama-2-7b" in root directory and places model files inside it. download is about 13GB
path = snapshot_download(repo_id="meta-llama/Llama-2-7b-chat",cache_dir="Llama-2-7b", use_auth_token=token) # were downloading the chat version of the 7b model

Description:
This cell is focused on downloading a model from the Hugging Face Hub. The specific steps and annotations are as follows:

The comments at the beginning mention that the huggingface_hub library is already installed in the "paperspace" environment. If someone is running this notebook outside of "paperspace", they should install the huggingface_hub library using pip.
The library huggingface_hub is imported, specifically the snapshot_download function.
A variable token is initialized with a placeholder value. This token is intended to be a "Read Token" from the user's Hugging Face account. The user is instructed to replace the placeholder text with their actual token.
The snapshot_download function is called to download a model. The model being downloaded is the "chat" version of the "Llama-2-7b" model from the repository "meta-llama/Llama-2-7b-chat". The downloaded model will be placed in a directory called "Llama-2-7b" in the root directory. The download size is mentioned to be approximately 13GB.
Now, let's move on to the content of the fifth cell.

In [None]:
# now we install the github repository for running llama2
# huggingface also has libraries to run llama2 but I use the following library for its simplicity and ease of use
!git clone https://github.com/ggerganov/llama.cpp.git

Description:
This cell is focused on cloning a GitHub repository to the current environment.

The initial comments clarify the purpose of this cell. While Hugging Face provides libraries to run the "llama2" model, the author of the notebook prefers using a different library from the GitHub repository due to its simplicity.
The !git clone command is a shell command executed within the notebook that clones the specified GitHub repository. In this case, the repository "https://github.com/ggerganov/llama.cpp.git" contains code for running the "llama2" model.
Next, we'll examine the content of the sixth cell.

In [None]:
# we have to move from the root directory to the above repository we just downloaded
%cd llama.cpp

# we have to install the requirements from the repository's 'requirements.txt' file
!pip install -r requirements.txt

Description:
This cell accomplishes two main tasks related to the previously cloned GitHub repository:

The %cd magic command is used to change the current working directory of the notebook to the "llama.cpp" directory, which was created when the repository was cloned in the previous cell.
The !pip install -r requirements.txt command installs the Python packages listed in the "requirements.txt" file of the "llama.cpp" repository. This ensures that all the necessary dependencies for running the code from that repository are available in the current environment.

In [None]:
# We restart the kernel to make use of the libraries we just downloaded
# A notice will appear that the kernel died but just ignore it and click okay or whatever pop-up appears and continue to the next block
import os
os.kill(os.getpid(), 9)

Description:
This cell forcefully restarts the kernel. Here's a breakdown of its content:

The comments explain that the purpose of this cell is to restart the kernel. This is often done to ensure that any newly installed libraries or changes to the environment are recognized and can be used in subsequent cells.
The user is informed that they might see a notice about the kernel dying. They are advised to ignore this notice and proceed with the notebook.
The os.kill(os.getpid(), 9) command sends a SIGKILL signal to the current process (i.e., the kernel), which causes it to terminate immediately.
It's worth noting that when running this cell in an interactive notebook environment, the user may need to manually restart the kernel or continue with the subsequent cells after the kernel has restarted.

Moving on, let's inspect the content of the eighth cell

In [None]:
# change directory to llama.cpp again since we restarted kernel
%cd llama.cpp

Description:
After restarting the kernel, the current working directory is reset to its default value. This cell changes the working directory back to the "llama.cpp" directory (the repository that was cloned earlier). This is accomplished using the %cd magic command.

Let's continue with the content of the ninth cell.

In [None]:
# now we have to build the library we just downloaded. if we don't set 'cublas=1' the command will use the cpu instead. ->
# -> This will make the models output much slower to generate

# This is the reason we also checked at the beginning of the notebook if we had cuda installed. 
# cuda is necessary to make use of the GPU
!make LLAMA_CUBLAS=1

Description:
This cell compiles and builds the "llama.cpp" library to make use of GPU acceleration:

The comments clarify that the library should be built with GPU support to speed up the model's output generation.
The !make LLAMA_CUBLAS=1 command builds the library. The LLAMA_CUBLAS=1 parameter specifies that the CUDA Basic Linear Algebra Subprograms (cuBLAS) library should be used, allowing the model to run on the GPU instead of the CPU. This is why the presence of CUDA was checked earlier in the notebook.

In [None]:
# IMPORTANT ->
# The above command built a few files based on the machine we are using. If you start a new paperspace instance with a different GPU, ->
# this guide will no longer work. A fix is to first run '!make clean' followed by '!make LLAMA_CUBLAS=1'

Description:
This cell provides important information and guidance regarding the previously executed build command:

The comments indicate that the build command from the previous cell generated files specific to the current machine and GPU setup.
If a user starts a new instance with a different GPU configuration, the guide may not work as expected.
A solution is provided: If users encounter issues on a different setup, they should first run !make clean to clean up the generated files and then re-run the !make LLAMA_CUBLAS=1 command to rebuild the library.

In [None]:
# we are now going to convert the model as we cannot use it in the format we downloaded it in. Usually, the format that its in is a .chk file format
# The repository we are using is converting it into a .gguf format.
# you can check the file format of the model we downloaded uncommenting the following code:
!ls ../Llama-2-7b/models--meta-llama--Llama-2-7b-chat/snapshots/2abbae1937452ebd4eecb63113a87feacd6f13ac

Description:
This cell provides information about the model's format and lists the files in the model's directory:

The comments explain that the model, as downloaded, is typically in the .chk file format, but the "llama.cpp" repository will be used to convert it into the .gguf format for compatibility.
The !ls command lists the files in the specified directory, which is where the model was downloaded. This allows the user to inspect the model's current file format.

In [None]:
# convert .chk file to .gguf
!python3 convert.py ../Llama-2-7b/models--meta-llama--Llama-2-7b-chat/snapshots/2abbae1937452ebd4eecb63113a87feacd6f13ac

Description:
This cell converts the model from the .chk file format to the .gguf format:

The convert.py script from the "llama.cpp" repository is invoked to perform the conversion.
The path provided as an argument to the convert.py script points to the location of the model in the .chk format. The script will then convert this model to the desired .gguf format.

In [None]:
# The code made a new file called 'ggml-model-f16.gguf' this is the converted llama2 model in .gguf format

# now we need to quantize the model. This part is necessary if we are using a weaker machine. 
# we can technically run it as is but it'll genrate text alot slower than if it were quantized

Description:
This cell provides explanatory comments about the next steps in the process:

The first comment informs the user that the previous cell created a new file named ggml-model-f16.gguf, which is the converted "llama2" model in the .gguf format.
The subsequent comments explain the need for model quantization. Quantizing the model can make it run more efficiently, especially on weaker hardware. While the model can run without quantization, the generation of text would be slower compared to a quantized model.
Let's inspect the content of the fourteenth cell.

In [None]:
# You can get a test run with the un-quantized model with this code block, 
# it'll generate a few random words so you can get a sense of its speed along with other model settings
!./main -m ../Llama-2-7b/models--meta-llama--Llama-2-7b-chat/snapshots/2abbae1937452ebd4eecb63113a87feacd6f13ac/ggml-model-f16.gguf -n 128

# As you can see, it's really slow to generate text.

Description:
This cell tests the speed of the un-quantized model:

The initial comments explain that this code block allows the user to test the speed of the un-quantized model by generating a few random words.
The !./main command runs the main executable of the "llama.cpp" repository with the specified model (ggml-model-f16.gguf) and a setting to generate 128 tokens (-n 128).
The concluding comment notes that the un-quantized model is slow in generating text, emphasizing the potential benefits of quantization.
Let's proceed to the content of the fifteenth cell.

In [None]:
# let's move the model so we can quantize it and afterwards chat with it
!mv ../Llama-2-7b/models--meta-llama--Llama-2-7b-chat/snapshots/2abbae1937452ebd4eecb63113a87feacd6f13ac/ggml-model-f16.gguf models/

Description:
This cell moves the model file to a different directory in preparation for the quantization and subsequent usage:

The comment provides context, indicating that the model will be moved to a new location for the next steps: quantization and chatting.
The !mv command is a shell command that moves the specified model file (ggml-model-f16.gguf) to the "models" directory.
Now, let's examine the content of the sixteenth cell.

In [None]:
# quantize the model
# the following code seems to be bugged. its probably since its meant for the terminal not for a notebook. Itll run but it wont output the quantized model the directory
# copy and paste the below code into the terminal. Make sure your current directory is inside llama.cpp folder before running
!./quantize models/ggml-model-f16.gguf ggml-model-q4_0.gguf q4_0

Description:
This cell attempts to quantize the model:

The initial comments explain that there might be an issue with the quantization command when executed from the notebook, as it might have been designed to run in a terminal environment. Although the command will execute, it might not produce the expected output.
The user is advised to copy the quantization command and execute it in a terminal, ensuring they are inside the "llama.cpp" directory.
The !./quantize command is intended to quantize the ggml-model-f16.gguf model and produce a new quantized model named ggml-model-q4_0.gguf with the quantization scheme q4_0.

In [None]:
# linux was giving me some issues (like always) so to fix them I saved the model to the llamma.cpp root directory. lets move the quantized model to the models directory
!mv ggml-model-q4_0.gguf models/

Description:
This cell is focused on moving the quantized model to the appropriate directory:

The comment mentions that due to issues encountered in Linux, the model was saved in the root directory of "llama.cpp".
The !mv command then moves the quantized model, ggml-model-q4_0.gguf, from the root directory of "llama.cpp" to the "models" subdirectory.
Now, let's review the content of the eighteenth cell.

In [None]:
# now that the model is reformmatted and quantized, we can finally start using it
# a premade chat prompt is already included in the llama.cpp repositroy so we'll use it
# the examples require us to place the model inside the models folder (like we already have) but it also requires it inside another folder called 'llama-7b'. 
# we'll create the folder, move the model inside it and run the model with example

!mkdir models/llama-7b
!mv models/ggml-model-q4_0.gguf models/llama-7b/

Description:
This cell prepares the environment to use the quantized model by moving it to a specific directory structure required by the examples:

The initial comments explain that now that the model is reformatted and quantized, it's ready for use. The "llama.cpp" repository includes a premade chat prompt that the notebook intends to utilize.
The chat examples in the "llama.cpp" repository expect the model to be placed inside a folder named "llama-7b" within the "models" directory.
The !mkdir command creates the "llama-7b" folder inside the "models" directory.
The !mv command moves the quantized model ggml-model-q4_0.gguf into the newly created "llama-7b" directory.

In [None]:
# now just copy the below code and paste it into the terminal to chat with llama2-7b
./examples/chat.sh


# the example named the model bob through a prompt, but you can use other code to run your own prompt for example: 
./main -m models/llama-7b/ggml-model-q4_0.gguf -n -1 --color -r "User:" --in-prefix " " -i -p \
'User: Hi
AI: Hello. I am an AI chatbot. Would you like to talk?
User: Sure!
AI: What would you like to talk about?
User:'

#  copy and paste the above code in the terminal, make sure your in the llama.cpp folder when running

# Unfortunatly, you can't run the model in the notebook. if you try running it, it'll run and stay on but since the model requires input and this notebook doesn't have a way to
# send input to the terminal while it runs, its not possible.

Description:
This cell provides commands to chat with the "llama2-7b" model:

The initial comment instructs users to copy the ./examples/chat.sh command and execute it in a terminal to start a chat session with the model.
A note mentions that the provided example names the model as "bob" in the chat prompt. However, if users want to customize the chat prompt, they can use an alternative command that specifies a different conversation format.
The ./main command provided is a more detailed way to start a chat session, specifying model parameters and an initial chat prompt. The given conversation format sets up an initial greeting between the user and the AI model.
A cautionary note clarifies that the model cannot be run directly in the notebook. This is because the model requires real-time user input, which can't be provided within the notebook's interface. Thus, users are encouraged to run the model in a terminal.
Let's move on to the content of the twentieth (and final) cell.

In [None]:
# you can run the model in the terminal and then run this block to see how gpu usage running the model takes. 
# acording to the info, it takes about 370MB of the GPU to run the model
!nvidia-smi

Description:
This cell allows users to monitor GPU usage while the model is running:

The comments suggest that after running the model in a terminal, users can execute this cell to observe the GPU resource utilization.
The provided information mentions that, based on previous observations, the model takes up about 370MB of the GPU memory when running.
The !nvidia-smi command, which we've seen earlier in the notebook, provides detailed statistics about the GPU's status, including memory usage, temperature, and more.