# <font color="#76b900">**1:** Getting Started With Large Language Models</font>

**Welcome To The Course!** This is the first content notebook and is intended to springboard you into the LLM loading workflow with some insights about our problem, our resources, and our objectives!

#### **Learning Objectives:**

- Review some basic assumptions about deep learning and show how they extend to language modeling.
- Pull in your first LLM into the environment, investigate its architecture, and see how it performs!

-------

In [0]:
import keras

2025-04-02 12:08:54.799518: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-04-02 12:08:54.849105: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-04-02 12:08:54.849166: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-04-02 12:08:54.850639: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-04-02 12:08:54.858602: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-04-02 12:08:54.859547: I tensorflow/core/platform/cpu_feature_guard.cc:1

In [0]:
keras.__version__

'2.15.0'

## 1.1. Recalling Deep Learning

Throughout your learning adventure with deep learning, you have probably optimized a variety of models for tasks like classification and regression. In order, you probably advanced in something like the following:

- When you started out, you used **linear and logistic regression** to model and interpret simple linear relationships that associated your inputs with your outputs.
- When that wasn't enough, you started **stacking linear layers one after another and adding non-linear activations** to give your model more predictive power.
- When your data started getting intractably high-dimensional, you started using more **informed sparsely-connected techniques like convolution** to add more control to your reasoning criteria.
- When you realized that you didn't have enough data to properly train your models for each specific task, you got **pre-trained components (i.e. VGG-16/ResNet)** that were trained on a giant repository of training data and already contained the necessary logic you wanted.

> <div><img src="imgs/machine-learning-process.jpg" width="800"/></div>
>
> **Source: [High-Performance Data Science with RAPIDS | NVIDIA](https://www.nvidia.com/en-us/deep-learning-ai/software/rapids/)**

If you've already gone through all of this, congratulations! You have roughly all the skills you need to advance far and wide beyond the topics you've studied so far, and that includes the awesome space of language modeling!

Similar to vision, language is a topic that is extremely complicated and high-dimensional if treated naively. Recall that a common 200x200 colored image contains $200\times 200\times 3 = 120,000$ features! Now imagine how many combinations of words can be found in a sentence? **A LOT!** Lucky for us, there are plenty of creative techniques that can be used to make this problem a lot more tractible, and the large pre-trained model ecosystem has a variety of tools to make them easy to implement!

**That's what this course is all about: How language problems can be approached, what tools are available, and what kinds of problems are out there!** 

-------

## 1.2. Pulling In Our First LLM

Instead of constructing things from the ground up, this course will focus on spot-lighting tools that you can use and diving into them as necessary to figure out exactly how they work. And the best tool to start our journey into language modeling is **HuggingFace &#x1F917;!**

[**HuggingFace**](https://huggingface.co/) is an open-source community that offers simple strategies for accessing, uploading, and using large deep learning models for testing and deployment. The topics they support span many tasks and modalities, but we'll be focusing on large language models (**LLMs**) for most of this course.

When searching through the [HuggingFace Models catalog](https://huggingface.co/models?sort=downloads&search=bert), you'll quickly stumble upon the [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) model. Taking a look at its card, you'll see several interesting things:

1. Loading in the model requires the use of the [`transformers`](https://github.com/huggingface/transformers) package. This is the HuggingFace package used to support most of the platform's language modeling code. Its name, `transformers`, refers to the primary architectural structure underlying many of these models, and we'll be talking about this structure in some detail throughout the next notebook. From here on out, you'll want to get comfortable with `transformers` and will be using it quite a bit, so feel free to search around and dive into the source code if you feel like it!
2. The card describes a default version that can be pulled in for mask filling (to be discussed) via its [Pipelines]([https://huggingface.co/docs/transformers/main_classes/pipelines]) support. By **pipeline**, we mean the end-to-end process of going from a human-reasonable input to a human-reasonable output. This makes it super-easy to pull in the model and helps you to forget that there is a tensor-in/tensor-out differentiable process going on somewhere under the hood.

As a representative example, we can go ahead and pull in the discussed [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) model and test it out!

In [0]:
from transformers import pipeline

## Loading in the pipeline and predict the mask fill (example from model card)
unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("Hello I'm a [MASK] model.")

Unexpected internal error when disabling torch.jit: No module named 'torch'


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

2025-04-02 12:09:29.241329: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 93763584 exceeds 10% of free system memory.
2025-04-02 12:09:29.357160: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 93763584 exceeds 10% of free system memory.
2025-04-02 12:09:29.372131: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 93763584 exceeds 10% of free system memory.
2025-04-02 12:09:30.351999: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 93763584 exceeds 10% of free system memory.
All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use 0
2025-04-02 12:09:32.268804: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 93763584 exceeds 10% of free system memory.


[{'score': 0.10731092095375061,
  'token': 4827,
  'token_str': 'fashion',
  'sequence': "hello i ' m a fashion model."},
 {'score': 0.08774460107088089,
  'token': 2535,
  'token_str': 'role',
  'sequence': "hello i ' m a role model."},
 {'score': 0.053383976221084595,
  'token': 2047,
  'token_str': 'new',
  'sequence': "hello i ' m a new model."},
 {'score': 0.046672143042087555,
  'token': 3565,
  'token_str': 'super',
  'sequence': "hello i ' m a super model."},
 {'score': 0.027095846831798553,
  'token': 2986,
  'token_str': 'fine',
  'sequence': "hello i ' m a fine model."}]

**Amazing! It just works!** Under the hood, there's a deep learning model somewhere - crunching numbers and spitting out probabilities to make all of this happen - but it's easy to forget that sometimes. It's especially easy to forget when the model you're dealing with is actually generating human-sounding text, at which point you may start to wonder if it's connected to a human brain somewhere in a warehouse in California. But that's what this course is for: **to see what's actually going on behind the scenes and know how to use it to make good systems**.

-------

## 1.3. Dissecting The Pipeline

Looking at this resolution - where we just see the pipeline taking strings in and spitting a dictionary out - isn't really helping our understanding much, so let's see what's actually going on with the pipeline. We can peel back the layer of abstraction just a little to see the structure inside of the pipeline:

In [0]:
from transformers import AutoTokenizer, BertTokenizer, BertModel, FillMaskPipeline, AutoModelForMaskedLM, BertForMaskedLM, BertForPreTraining 

from transformers import AutoTokenizer, AutoModel        ## General-purpose fully-automatic
from transformers import TFAutoModelForMaskedLM            ## Default import for FillMaskPipeline
from transformers import BertTokenizer, BertForMaskedLM  ## Realized components after automatic resolution

class MyMlmPipeline(FillMaskPipeline):
    def __init__(self):
        ## The fully-automatic version
        super().__init__(
            tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased'),
            model = TFAutoModelForMaskedLM.from_pretrained("bert-base-uncased")
        )

    def __call__(self, string, verbose=False):
        ## Verbose argument just there for our convenience
        input_tensors = self.preprocess(string)
        if verbose: print('\npreprocess outputs:\n', input_tensors, '\n')
        output_tensors = self.forward(input_tensors)
        if verbose: print('forward outputs:\n', output_tensors, '\n')
        output = self.postprocess(output_tensors)
        return output

    # def preprocess(self, string):
    #     string = [string] if isinstance(string, str) else string
    #     inputs = self.tokenizer(string, return_tensors="pt")
    #     return inputs

    # def forward(self, tensor_dict):
    #     output_tensors = self.model.forward(**tensor_dict)
    #     return {**output_tensors, **tensor_dict}

    # def postprocess(self, tensor_dict):
    #     ## Very Task-specific; see FillMaskPipeline.postprocess
    #     return super().postprocess(tensor_dict)

unmasker = MyMlmPipeline()
unmasker("Hello, Mr. Bert! How is it [MASK]?", verbose=True)

com.databricks.backend.common.rpc.DriverStoppedException: Driver down cause: driver state change (exit code: 137)
	at com.databricks.spark.chauffeur.ChauffeurState.processDriverStateChange(ChauffeurState.scala:276)
	at com.databricks.spark.chauffeur.Chauffeur.onDriverStateChange(Chauffeur.scala:1565)
	at com.databricks.spark.chauffeur.Chauffeur.$anonfun$driverStateOpt$1(Chauffeur.scala:203)
	at com.databricks.spark.chauffeur.Chauffeur.$anonfun$driverStateOpt$1$adapted(Chauffeur.scala:203)
	at com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.$anonfun$goToStopped$4(DriverDaemonMonitorImpl.scala:295)
	at com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.$anonfun$goToStopped$4$adapted(DriverDaemonMonitorImpl.scala:295)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.goToStopped(DriverDaemonMonitorImpl.scala:295)
	at com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.monitorDriver(DriverDaemonMonitorImpl.s

We can also see that the model is largely comprised of two main components:
- `tokenizer`: The strategy to convert the input strings to something usable by the model.
- `model`: The deep learning model responsible for the input-tensor-to-output-tensor conversion.

With these, the pipeline is able to support its streamlined interface with a pretty intuitive organization scheme:  
- `preprocess`: human-intuitive input $\to$ tensor inputs. Facilitated by `tokenizer`
- `forward`: tensor inputs $\to$ tensor outputs. Facilitated by `model`
- `postprocess`: tensor outputs $\to$ human-intuitive outputs. Facilitated by the pipeline task.

For deep learning, this actually seems pretty reasonable; the model reasons in numbers, and you probably don't want to expose that to the typical user when your domain is language. This makes it very easy for a typical starting user to just pick up the models and roll with them, so hopefully you feel a bit more comfortable when approaching the open-sourced LLM ecosystem!

-------

## 1.4. Your Course Environment

So yes, pulling in a model is just that easy! Throughout the course, feel free to pull in models that you think are interesting and see how they function! Do try to keep them small unless we specifically ask you for a giant model; **your compute environment is relatively powerful, but not infinite!** We've already pre-loaded a selection of models for you, so please check them out (and consider their licenses) in the [`extras_and_licenses/99_licenses.ipynb`](extras_and_licenses/99_licenses.ipynb).

For this course, you'll be using a relatively powerful compute budget with regards to consumer-level hardware configurations, and the following will be especially important for language modeling: 
- **System Memory**: The largest language models are ***large***, and working with them can easily overload a consumer-level memory budget. Certain workloads can require tens or even hundreds of GB for a single model, so this environment is equipped with enough to hold what we need.
- **GPU**: GPU power is extremely important for performing fast deep learning training and inference, since deep learning involves a lot of number crunching to mold your inputs into predictions. Much of this is [*embarrassingly parallelizable*](https://en.wikipedia.org/wiki/Embarrassingly_parallel), so the thousands of cores associated with many modern GPUs (especially the [CUDA cores](https://en.wikipedia.org/wiki/CUDA) of NVIDIA GPUs) are incredibly useful for speeding up the forward and backward passes. 
    - **GPU RAM**: Large Language Models need to be loaded up for rapid use on the GPU, so for that the GPU RAM is important to maintain the necessary information in memory. Many applications try to make good use of both the CPU and GPU, but sometimes low GPU RAM can impose a serious constraint on your ability to use accelerated LLMs. 

Specifically, you're allocated the following resources: 

In [0]:
%%bash
echo """
===================================================
GPU SPECIFICATION
===================================================
"""
nvidia-smi
echo """
===================================================
MEMORY SPECIFICATION
===================================================
"""
cat /proc/meminfo

**So yeah, decent compute budget, *but not infinite*!**

Before starting the next notebook, please restart the jupyter kernel by running the code cell below. This will prevent memory issues in future notebooks and will keep the instance memory load from overpowering our compute budget.

-------

## 1.5. Wrapping Up

Now that you've seen how easy it is to pull in a model, now we get to the hard parts: 

**Can I actually use these models?** That really depends on licensing:

- ***The earlier models we'll look at will have licenses associated with the data, and many of the fine-tuned ones are trained for proof of concept only and are therefore not commercially-viable.*** After this course, you'll be able to experiment with them and see whether you can find one that's good enough and also viable. Alternatively, you'll be able to take inspiration from the models you find and can fine-tune your own dataset!

- ***The later models we'll look at are viable for use commercially and are extremely powerful and general!*** They are amazing on their own and can work great with the smaller models to satisfy compute budgets and control structures!

- **NOTE:** We greatly encourage you to briefly check out [`extras_and_licenses/99_licenses.ipynb`](extras_and_licenses/99_licenses.ipynb) for a good look at the licenses and considerations.

**How and why do they work?** This will be talked about at length.

**What models do you choose?** This will be talked about at length. 

**In the next few notebooks, we'll be getting familiar with how these systems work!**

<font color="#76b900">**Get excited, and welcome to the course!!**</font>