# Walkthrough

In this era of large-scale deep learning, the most interesting AI models are massive black boxes that are hard to run. Ordinary commercial inference service APIs let you interact with huge models, but they do not let you access model internals.

The nnsight library is different: it gives you full access to all the neural network internals. When used together with a remote service like the [National Deep Inference Facility](https://thevisible.net/docs/NDIF-proposal.pdf) (NDIF), it lets you run complex experiments on huge open source models easily, with fully transparent access.

Our team wants to enable entire labs and independent researchers alike, as we believe a large, passionate, and collaborative community will produce the next big insights on a profoundly important field.

# But first, let's start small





## The Tracing Context

To demonstrate the core funtionality and syntax of nnsight, we'll define and use a tiny two layer neural network.

Our little model here is composed of four sub-modules, two linear layers ('layer1', 'layer2'). We spcecify the sizes of each of these modules, and create some complementary example input.

In [1]:
from collections import OrderedDict
import torch

input_size = 5
hidden_dims = 10
output_size = 2

net = torch.nn.Sequential(OrderedDict([
    ('layer1', torch.nn.Linear(input_size, hidden_dims)),
    ('layer2', torch.nn.Linear(hidden_dims, output_size)),
])).requires_grad_(False)

input = torch.rand((1, input_size))

The core object of the nnsight package is `NNsight`. This wraps around a given pytorch model to enable the capabilites nnsight provides.

In [2]:
from nnsight import NNsight

model = NNsight(net)

  from .autonotebook import tqdm as notebook_tqdm


Pytorch models when printed, show a named hierarchy of modules which is very useful when accessing sub-components directly. NNsight models work just the same.

In [3]:
print(model)

Sequential(
  (layer1): Linear(in_features=5, out_features=10, bias=True)
  (layer2): Linear(in_features=10, out_features=2, bias=True)
)


Before we actually get to using the model we just created, let's talk about what a `context` is in Python.

Often enough when coding, you want to create some object, or initiate some logic, that you later want to destroy or conclude.

The most common application is opening files like the following example:

```python
with open('myfile.txt', 'r') as file:
  text = file.read()
```

Python uses the `with` keyword to enter a context-like object. This object defines logic to be ran at the start of the with block, and logic to be ran when exiting. In this case, entering the context opens a file and exiting the context closes it. Being within the context means we can read from file. Simple enough! Now we can discuss how `nnsight` uses contexts to enable powerful and intuitive access into the internals of model computation.


Introducing the tracing context. Just like before, something happens upon entering the tracing context, something happens when exiting, and being inside enables some functionality.

We enter the tracing context by calling `.trace(<input>)` on the `NNsight` model we created before. Entering it denotes we want to run the model given our input... but not yet! The model is only ran upon exiting the tracing context.

In [4]:
with model.trace(input) as tracer:
  pass

But where's the output? To get that, we'll have to learn how to request it from within the tracing conext.

## Getting

Earlier, when we wrapped our little net with the `NNsight` class, this added a couple properties on to each module in the model (including the root model itself). The ones we care about are `.input` and `.output`.

```python
model.input
model.output
```

The names are pretty self explainatory. They correspond to the inputs and outputs of their respective modules during some forward pass of an input through the model. These are what we're going to interact with in the tracing context.

However, remember how the model isnt executed until the end of the tracing context? So how can we access their inputs and outputs during computation from within the context? Well, we can't.

`.input` and `.output` are Proxies for the eventual inputs and outputs of a module. In other words, when you access `model.output` what you're communicating to `nnsight` is "When you compute the output of `model`, please grab it for me and put the value into it's corresponding Proxy object's `.value` attribute." Let's try just that:

In [5]:
with model.trace(input) as tracer:

  output = model.output

print(output.value)

ValueError: Accessing Proxy value before it's been set.

Oh no an error! "Accessing Proxy value before it's been set."

If `.value` isn't filled in after leaving the tracing context, accessing the value will give you this error.  In reality however, the value was filled in, it was just immediately removed. Why?

Proxy objects track their listeners (as in other Proxy object that rely on it), and when their listeners are all complete, it deletes the `.value` associated with the Proxy in order to save memory. To prevent this, we call `.save()` on the Proxy objects we want to access outisde of the tracing context:

In [6]:
with model.trace(input) as tracer:

  output = model.output.save()

print(output.value)

tensor([[-0.1988, -0.0418]])


Success! We now have the model output meaning you just completed your first intervention request using Proxies.

These requests are handled at the soonest possible moment they can be completed. In this case, right after the model's output was computed. Collectively these requests form the `intervention graph` and we call the process of executing it alongside the model's normal computation graph, `interleaving`.

What else can we request? There's nothing special about the model itself vs it's submodules. Just like we saved the output of the model as a whole, we can save the output of any of it's submodules. To get to them we use normal Python attribute syntax, and we know where the modules are becuase we printed out the model earlier:

In [7]:
print(model)

Sequential(
  (layer1): Linear(in_features=5, out_features=10, bias=True)
  (layer2): Linear(in_features=10, out_features=2, bias=True)
)


In [8]:
with model.trace(input) as tracer:

  l1_output = model.layer1.output.save()

print(l1_output.value)

tensor([[ 0.4802, -0.0706, -0.4357,  0.0113,  0.0720, -0.1803,  0.4475, -0.1772,
         -0.1301,  0.3921]])


Let's do the same for the input of layer2. While we're at it, let's also drop the `as tracer`, as we won't be needing the tracer object itself for a few sections:

In [9]:
with model.trace(input):

  l2_input = model.layer2.input.save()

print(l2_input.value)

((tensor([[ 0.4802, -0.0706, -0.4357,  0.0113,  0.0720, -0.1803,  0.4475, -0.1772,
         -0.1301,  0.3921]]),), {})


<details>
  <summary>On module inputs</summary>

  ---

  Notice how the value for l2_input, was not just a single tensor.
  The type/shape of values from .input is in the form of:

      tuple(tuple(args), dictionary(kwargs))

  Where the first index of the tuple is itself a tuple of all positional arguments, and the second index is a dictionary of the keyword arguments.

  ---

</details>


Now that we can access activations, we also want to do some post-processing on it. Let's find out which dimension of layer1's output has the highest value.

## Functions, Methods, and Operations

We could do this by calling `torch.argmax(...)` after the tracing context... or we can just leverage the fact that `nnsight` handles functions and methods within the tracing context, by creating a Proxy request for it:

In [10]:
with model.trace(input):

  # Note we don't need to call .save() on the output,
  # as we're only using it's value within the tracing context.
  l1_output = model.layer1.output

  l1_amax = torch.argmax(l1_output, dim=1).save()

print(l1_amax[0])

tensor(0)


Nice! That worked seamlessly, but hold on, how come we didn't need to call `.value[0]` on the result? In previous sections, we were just being explicit to get an understaing of Proxies and their value. In practice however, `nnsight` knows that when outside of the tracing context we only care about the actual value, and so printing, indexing, and applying functions all immediately return and reflect the data in `.value`. So for the rest of the tutorial we won't use it.

The same principles work for methods and operations as well:

In [11]:
with model.trace(input):

  value = (model.layer1.output.sum() + model.layer2.output.sum()).save()

print(value)

tensor(0.1685)


By default torch functions and methods, as well as all operators work with `nnsight`. We also enable the use of the `einops` library.

So to recap, this cell is saying to `nnsight`, "Run the model with the given `input`. When the output of layer1 is computed, take it's sum. Then do the same for layer2. Now that both of those are computed, add them and make sure not to delete this value as I wish to use it outisde of the tracing context."

Getting and analyzing the activations from various points in a model can be really insightful, and a number of ML techniques do exactly that. However, often times we not only want to view the computation of a model, but influence it as well.

## Setting

To demonstrate editing the flow of information through the model, let's set the first dimension of the first layer's output to 0. `NNsight` makes this really easy using the familiar assignment '=' operator:

In [12]:
with model.trace(input):

  # Save the output before the edit to compare.
  # Notice we apply .clone() before saving as the setting operation is in-place.
  l1_output_before = model.layer1.output.clone().save()

  # Access the 0th index of the hidden state dimension and set it to 0.
  model.layer1.output[:, 0] = 0

  # Save the output after to see our edit.
  l1_output_after = model.layer1.output.save()

print("Before:", l1_output_before)
print("After:", l1_output_after)

Before: tensor([[ 0.4802, -0.0706, -0.4357,  0.0113,  0.0720, -0.1803,  0.4475, -0.1772,
         -0.1301,  0.3921]])
After: tensor([[ 0.0000, -0.0706, -0.4357,  0.0113,  0.0720, -0.1803,  0.4475, -0.1772,
         -0.1301,  0.3921]])


Seems our change was reflected. Now the same for the last dimension:

In [13]:
with model.trace(input):

  # Save the output before the edit to compare.
  # Notice we apply .clone() before saving as the setting operation is in-place.
  l1_output_before = model.layer1.output.clone().save()

  # Access the last index of the hidden state dimension and set it to 0.
  model.layer1.output[:, hidden_dims] = 0

  # Save the output after to see our edit.
  l1_output_after = model.layer1.output.save()

print("Before:", l1_output_before)
print("After:", l1_output_after)

IndexError: index 10 is out of bounds for dimension 1 with size 10

Ah of course, we needed to index at `hidden_dims - 1` not `hidden_dims`. How did `nnsight` know there was this indexing error before leaving the tracing context?

Earlier when discussing contexts in Python, we learned some logic happens upon entering, and some logic happens upon exiting. We know the model is actually ran on exit, but what happens on enter? Our input IS actually ran though the model, however under its own "fake" context. This means the input makes its way through all of the models operations, allowing `nnsight` to record the shapes and data types of module inputs and outputs! The operations are never executed using tensors with real values so it dosen't incur any memory costs. Then, when creating proxy requests like the setting one above, `nnsight` also attempts to execute the request on the "fake" values we recorded, letting us know if our request is feasible before even running the model.

<details>
<summary>On scanning</summary>

---

"Scanning" is what we call running "fake" inputs throught the model to collect information like shapes and types. "validating" is what we call trying to execute your intervention proxies with "fake" inputs to see if they work. If youre doing anything in a loop where efficiency is import, you should turn off scanning an validating. You can turn off validating in `.trace(...)` like `.trace(..., validate=False)`. You can turn off scanning in `Tracer.invoke(...)` ([see the Batching section](#batching-id)) like `Tracer.invoke(..., scan=False)`

---

</details>

Let's try again with the correct indexing, and view the shape of the output before leaving the tracing context:

In [14]:
with model.trace(input):

  # Save the output before the edit to compare.
  # Notice we apply .clone() before saving as the setting operation is in-place.
  l1_output_before = model.layer1.output.clone().save()

  print(f"layer1 output shape: {model.layer1.output.shape}")

  # Access the last index of the hidden state dimension and set it to 0.
  model.layer1.output[:, hidden_dims - 1] = 0

  # Save the output after to see our edit.
  l1_output_after = model.layer1.output.save()

print("Before:", l1_output_before)
print("After:", l1_output_after)

layer1 output shape: torch.Size([1, 10])
Before: tensor([[ 0.4802, -0.0706, -0.4357,  0.0113,  0.0720, -0.1803,  0.4475, -0.1772,
         -0.1301,  0.3921]])
After: tensor([[ 0.4802, -0.0706, -0.4357,  0.0113,  0.0720, -0.1803,  0.4475, -0.1772,
         -0.1301,  0.0000]])


We can also just replace proxy inputs and outputs with tensors of the same shape and type. Let's use the shape information we have at our disposal to add noise to the output, and replace it with this new noised tensor:

In [15]:
with model.trace(input):

  # Save the output before the edit to compare.
  # Notice we apply .clone() before saving as the setting operation is in-place.
  l1_output_before = model.layer1.output.clone().save()

  # Create random noise with variance of .001
  noise = (0.001**0.5)*torch.randn(l1_output_before.shape)

  # Add to original value and replace.
  model.layer1.output = l1_output_before + noise

  # Save the output after to see our edit.
  l1_output_after = model.layer1.output.save()

print("Before:", l1_output_before)
print("After:", l1_output_after)

Before: tensor([[ 0.4802, -0.0706, -0.4357,  0.0113,  0.0720, -0.1803,  0.4475, -0.1772,
         -0.1301,  0.3921]])
After: tensor([[ 0.5208, -0.0984, -0.4089,  0.0626,  0.0895, -0.2086,  0.4192, -0.1822,
         -0.1020,  0.4061]])


## Gradients

`NNsight` can also let you apply backprop and access gradients with respect to a loss. Like `.input` and `.output` on modules, `nnsight` also exposes `.grad` on Proxies themselves (assuming they are proxies of tensors):

In [16]:
with model.trace(input):

  # We need to explicitly have the tensor require grad
  # as the model we definded earlier turned off requiring grad.
  model.layer1.output.requires_grad = True

  # We call .grad on a tensor Proxy to communicate we want to store its gradient.
  # We need to call .save() of course as .grad is it's own Proxy.
  layer1_output_grad = model.layer1.output.grad.save()
  layer2_output_grad = model.layer2.output.grad.save()

  # Need a loss to propagate through the later modules in order to have a grad.
  loss = model.output.sum()
  loss.backward()

print("Layer 1 output gradient:", layer1_output_grad)
print("Layer 2 output gradient:", layer2_output_grad)


Layer 1 output gradient: tensor([[-0.1807, -0.0932, -0.0272, -0.1585,  0.0117, -0.1133, -0.1660, -0.3290,
          0.3369, -0.0236]])
Layer 2 output gradient: tensor([[1., 1.]])


All of the features we learned previously apply to `.grad`, meaning we can apply operations and edit the gradients. Let's zero the grad of layer1 and double the grad of layer 2.

In [17]:
with model.trace(input):

  # We need to explicitly have the tensor require grad
  # as the model we definded earlier turned off requiring grad.
  model.layer1.output.requires_grad = True

  model.layer1.output.grad[:] = 0
  model.layer2.output.grad = model.layer2.output.grad.clone() * 2

  layer1_output_grad = model.layer1.output.grad.save()
  layer2_output_grad = model.layer2.output.grad.save()

  # Need a loss to propagate through the later modules in order to have a grad.
  loss = model.output.sum()
  loss.backward()

print("Layer 1 output gradient:", layer1_output_grad)
print("Layer 2 output gradient:", layer2_output_grad)

Layer 1 output gradient: tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
Layer 2 output gradient: tensor([[2., 2.]])


# Bigger

Now that we have the basics of `nnsight` under our belt, we can scale our model up and combine the techniques we've learned into more interesting experiments.

The `NNsight` class is very bare bones. It wraps a pre-defined model and does no pre-processing on the inputs we enter. It's designed to be extended with more complex and powerful types of models and we're excited to see what can be done to leverage the base line features of `nnsight`.

## LanguageModel

`LanguageModel` is a subclass of `NNsight`.  While we could define and create a model to pass in directly, `LanguageModel` includes special support for Huggingface language models, including automatically loading models from a Huggingface ID, and loading the model together with the appropriate tokenizer.

Here is how you can use `LanguageModel` to load `GPT-2`:

In [18]:
from nnsight import LanguageModel

model = LanguageModel('openai-community/gpt2', device_map="auto")

print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
  (generator): WrapperModule()
)


<details>
<summary>On Model Initialization</summary>

---

A few important things to note:

  Keyword arguments passed to the initialization of `LanguageModel` make its way to the huffingface specific loading logic. In this case, `device_map` specifies which devices to use and it's value `auto` indicates to evenly distribute it to all available GPUs (and cpu if no GPUs available). Other arguments can be found here: https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM


  When we initialize `LanguageModel`, we aren't yet loading the parameters of the model into memory. We are actually loading a 'meta' version of the model which dosen't take up any memory, but still allows us to view and trace actions on it. Only when the first tracing context exits is the real model with full parameters (using the keyword arguments we defined on init) loaded into memory. To load into memory on initialization, you can pass `dispatch=True` into `LanguageModel` like `LanguageModel('openai-community/gpt2', device_map="auto", dispatch=True)`.

---

</details>


Let's put together some of the features we applied to the small model, but now on `GPT-2`. Unlike `NNsight`, `LanguageModel` does define logic to pre-process inputs upon entering the tracing context which makes interacting with the model simpler without having to directly access the tokenizer.

In this example, we ablate the value coming from the last layer's MLP module and decode the logits to see what token the model predicts without infucence from the module:

In [19]:
with model.trace('The Eiffel Tower is in the city of'):

  # Access the last layer using h[-1] as it's a ModuleList
  # Access the first index of .output as that's where the hidden states are.
  model.transformer.h[-1].mlp.output[0][:] = 0

  # Logits come out of model.lm_head and we apply argmax to get the predicited token ids.
  token_ids = model.lm_head.output.argmax(dim=-1).save()

print("Token IDs:", token_ids)

# Apply the tokenizer to decode the ids into words after the tracing context.
print("Prediction:", model.tokenizer.decode(token_ids[0][-1]))


You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Token IDs: tensor([[ 262,   12,  417, 8765,   11,  257,  262, 3504,  338, 3576]],
       device='mps:0')
Prediction:  London


You just ran a little intervention on a much more complex model with orders of magnitude more parameters! An import piece of information we're missing though is what the prediction would look like without our ablation.

Of course we could just run two tracing contexts and compare the outputs, however this would require two forward passes through the model. `NNsight` can do better than that.

<a name="batching-id"></a>
## Batching

It's time to bring back the `Tracer` object we dropped before. See, when you call `.trace(...)` with some input, it's actually creating two different contexts behind the scenes. The second one is the invoker context. Being within this context just means that `.input` and `.output` should refer only to the input you've given invoke, and calling `.trace(...)` with some input just means there's only one input and therefore only one invoker context.

We can call `.trace()` without input and call `Tracer.invoke(...)` to manually create the invoker context with our input. Now every subsequent time we call `.invoke(...)`, new interventions will only refer to the input in that invoke. Then when exiting the tracing context, the inputs from all of the invokers will be batched together, and it along with your Proxies will be executed in one forward pass! So let's do the ablation experiment, and compute a 'control' output to compare to:

<details>
<summary>On the invoker context</summary>

---

Note that injecting  data to only the relevant invoker interventions, `nnsight` tries, but can't guarantee it can narrow the data into the right batch idxs (in the case of an object as input or output). So there are cases where all invokes will get all of the data.

Just like `.trace(...)` created a `Tracer` object, `.invoke(...)` creates an `Invoker` object. One thing the `Invoker` object has is the post-processed inputs at `invoker.inputs` which can be useful for seeing information about your input. If youre using `.trace(...)` with inputs, you can still access the invoker object at `tracer._invoker`.

Keyword arguments given to `.invoke(..)` make its way to the input pre-processing. For example in `LanguageModel`, the keyword arguments are used to tokenize like max_length and truncation. If you need to pass in keyword arguments directly to a one input `.trace(...)`, you can pass an invoker_args keyword argument that should be a dictionary of keyword arguments for the invoker. `.trace(..., invoker_args={...})`

---

</details>

In [20]:
with model.trace() as tracer:

  with tracer.invoke('The Eiffel Tower is in the city of'):

    # Ablate the last MLP for only this batch.
    model.transformer.h[-1].mlp.output[0][:] = 0

    # Get the output for only the intervened on batch.
    token_ids_intervention = model.lm_head.output.argmax(dim=-1).save()

  with tracer.invoke('The Eiffel Tower is in the city of'):

    # Get the output for only the original batch.
    token_ids_original = model.lm_head.output.argmax(dim=-1).save()

print("Original token IDs:", token_ids_original)
print("Intervention token IDs:", token_ids_intervention)

print("Original prediction:", model.tokenizer.decode(token_ids_original[0][-1]))
print("Intervention prediction:", model.tokenizer.decode(token_ids_intervention[0][-1]))

Original token IDs: tensor([[ 198,   12,  417, 8765,  318,  257,  262, 3504, 7372, 6342]],
       device='mps:0')
Intervention token IDs: tensor([[ 262,   12,  417, 8765,   11,  257,  262, 3504,  338, 3576]],
       device='mps:0')
Original prediction:  Paris
Intervention prediction:  London


So it did actually end up effecting what the model predicted. That's pretty neat.

Another cool thing with multiple invokes is that the Proxies can interact between them! Here we transfer the word token embeddings from a real prompt into one of all blanks. Therefore the blank prompt produces the output of the real prompt:

In [21]:
with model.trace() as tracer:

  with tracer.invoke("The Eiffel Tower is in the city of"):

    embeddings = model.transformer.wte.output

  with tracer.invoke("_ _ _ _ _ _ _ _ _ _"):

    model.transformer.wte.output = embeddings

    token_ids_intervention = model.lm_head.output.argmax(dim=-1).save()

  with tracer.invoke("_ _ _ _ _ _ _ _ _ _"):

    token_ids_original = model.lm_head.output.argmax(dim=-1).save()

print("Original prediction:", model.tokenizer.decode(token_ids_original[0][-1]))
print("Intervention prediction:", model.tokenizer.decode(token_ids_intervention[0][-1]))

Original prediction:  _
Intervention prediction:  Paris


## .next()

Some Huggingface models define methods to generate multiple outputs at a time. `LanguageModel` wraps that functionality to provide the same tracing features just by using `.generate(...)` instead of `.trace(...)`. This calls the underlying model's `.generate` method and passes the output through a `model.generator` module that we've added onto the model, allowing you to get the generate output at `model.generator.output`.

In a case like this, the underlying model is called more than once. Meaning the modules of said model produce more than one output. So which iteration should a given `module.output` refer to? That's where `Module.next()` comes in.

Each module has a call idx associated with it and `.next()` simply increments that attribute. Come execution time, data is injected into the intervention graph only at the call idx defined.

In [22]:
with model.generate("The Eiffel Tower is in the city of", max_new_tokens=3):

  token_ids_1 = model.lm_head.output.argmax(dim=-1).save()

  token_ids_2 = model.lm_head.next().output.argmax(dim=-1).save()

  token_ids_3 = model.lm_head.next().output.argmax(dim=-1).save()

  output = model.generator.output.save()

print("Prediction 1: ", model.tokenizer.decode(token_ids_1[0][-1]))
print("Prediction 2: ", model.tokenizer.decode(token_ids_2[0][-1]))
print("Prediction 3: ", model.tokenizer.decode(token_ids_3[0][-1]))

print("All token ids: ", output)

print("All prediction: ", model.tokenizer.batch_decode(output))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prediction 1:   Paris
Prediction 2:  ,
Prediction 3:   and
All token ids:  tensor([[ 464,  412,  733,  417, 8765,  318,  287,  262, 1748,  286, 6342,   11,
          290]], device='mps:0')
All prediction:  ['The Eiffel Tower is in the city of Paris, and']


# I thought you said huge models?

`NNsight` is only one half our project to democratize access to AI internals. The other half being `NDIF` (National Deep Inference Facility).

The interaction between the two is fairly straightforward. The `intervention graph` we create via the tracing context can be encoded into a custom json format and sent via an http request to the `NDIF` servers. `NDIF` then decodes the `intervention graph` and `interleaves` it alongside the specified model. That's it!

To see which models are currently being hosted, checkout out this status page: https://nnsight.net/status/

## Remote execution

In it's current state, `NDIF` requires you to recieve an API key. Therefore to run the rest of this colab you need one of your own. To get one simply join the [NDIF discord](https://discord.gg/6uFJmCSwW7) and introduce yourself on the `#introductions` channel. Then DM either @JadenFK or @caden and we'll create one for you.

Once you have one, to register your api key with `nnsight`, do the following:

In [None]:
from nnsight import CONFIG

CONFIG.set_default_api_key("<your api key here>")

This only needs to be ran once as it will save this api key as the default in a config file along with the `nnsight` installation.

To amp things up a few levels, let's demonstrate using `nnsight`'s tracing context with one of the larger open source language models, Llama 2 70b!

In [23]:
# We'll never actually load the parameters so no need to specify a device_map.
model = LanguageModel("meta-llama/Llama-2-70b-hf")

# All we need to specify using NDIF vs executing locally is remote=True.
with model.trace('The Eiffel Tower is in the city of', remote=True) as runner:

    hidden_states = model.model.layers[-1].output.save()

    output = model.output.save()

print(hidden_states)

print(output['logits'])

65cebeafa7c4e61fa5de838a - RECEIVED: Your job has been received and is waiting approval.
65cebeafa7c4e61fa5de838a - APPROVED: Your job was approved and is waiting to be run.
65cebeafa7c4e61fa5de838a - COMPLETED: Your job has been completed.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Downloading result: 100%|██████████| 9.03M/9.03M [00:00<00:00, 10.6MB/s]

(tensor([[[ -0.1491,  -5.8323,   4.3558,  ...,   3.4249,  23.8446, -13.8308],
         [  9.9500,  -1.2191,   1.3571,  ...,   0.2700,  31.3980,  -8.7688],
         [ -1.6217,   9.1563,  -1.9623,  ...,   4.9550,  14.7141,  27.5813],
         ...,
         [ -2.7155,   5.6727,   0.7352,  ...,   5.7620,  26.6322,  -1.6128],
         [  2.9816,   0.3416,  -0.7144,  ...,   2.7997,  32.5098,   4.9818],
         [  0.2935,   1.6404,   6.5683,  ...,   0.1217,  53.6284,  -0.1982]]]), <transformers.cache_utils.DynamicCache object at 0x1680ca670>)
tensor([[[ -9.5869,  -3.8338,   1.8843,  ...,  -4.9517,  -4.6490,  -4.3777],
         [-10.0180, -10.1247,  -2.0404,  ...,  -6.5920,  -7.4882,  -5.7967],
         [ -5.3571, -10.4557,   0.8910,  ...,  -3.3766,  -2.3204,  -1.7122],
         ...,
         [-11.2018, -11.9971,  -0.2768,  ...,  -7.3000,  -6.6654,  -5.2826],
         [ -6.1399,  -4.2103,   7.6792,  ...,  -3.6305,  -3.2480,  -2.1176],
         [-12.5268, -12.0461,   5.5539,  ...,  -5.2163,  -




It really is as simple as `remote=True`. All of the techniques we went through in earlier sections work just the same when running locally and remotely.

Note that both `nnsight`, but especially `NDIF` is in active development and therfore there may be caveats, changes, and errors to work through.

# Getting Involved!

If you're interested in following updates to `nnsight`, contributing, giving feedback, or finding collaborators, please join the [NDIF discord](https://discord.gg/6uFJmCSwW7)!

The [Mech Interp discord](https://discord.gg/km2RQBzaUn) is also a fantastic place to discuss all things mech interp with a really cool community.

Our website [nnsight.net](https://nnsight.net/), has a bunch more tutorials detailing more complex interpretablity techniques using `nnsight`. If you want to share any of the work you do using `nnsight`, let others know on either of the discords above and we might turn it into a tutorial on our website.

💟



