### Trying different quantization with greedy decoding

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TextStreamer, set_seed
import torch
import numpy as np
import random

In [2]:
# model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Llama-8B")
# tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Llama-8B")

In [2]:
# Set seeds for reproducibility
random_seed = 42
np_seed = 42
torch_seed = 42
transformers_seed = 42

In [3]:
random.seed(random_seed)
np.random.seed(np_seed)
torch.manual_seed(torch_seed)
set_seed(transformers_seed)

In [4]:
# Load model with 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    # bnb_4bit_use_double_quant=True,
    # bnb_4bit_quant_type="nf4",
    # bnb_4bit_compute_dtype=torch.bfloat16,
    # llm_int8_enable_fp32_cpu_offload=True
)

In [2]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Llama-8B")
device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [7]:
model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    quantization_config=quantization_config,
    device_map=device
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

#### Above line changed gpu usage from 42MiB to 9115 MiB

In [8]:
print(model.get_memory_footprint())

9081200896


In [9]:
# Tokenize input
# input_text = "Hey, are you conscious? Can you talk to me?"
input_text = "Imagine a runaway trolley is hurtling down a track towards five dead people. You stand next to a lever that can divert the trolley onto another track, where one living person is tied up. Do you pull the lever?"

In [10]:
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

In [11]:
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

In [12]:
%time   output_dict = model.generate(input_ids, max_new_tokens = 100000, do_sample = False, pad_token_id=tokenizer.eos_token_id, streamer = streamer, return_dict_in_generate=True, output_scores=True)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


 This is the classic trolley problem, a puzzle in decision ethics. The question is, is it morally permissible to divert the trolley to kill one person to save five?

Now, in a twist, the trolley problem is being applied to the realm of artificial intelligence. Instead of a trolley, we have an AI system that can be directed to perform actions that might result in harm. The question is, how do we decide whether to pull the lever, i.e., whether to allow the AI to act in a way that could cause harm, in order to prevent greater harm?

In the original trolley problem, the choice is clear: pulling the lever saves five lives at the cost of one. But when it comes to AI, the situation is more complex. AI systems can be designed to follow certain ethical guidelines, but when those guidelines are tested in real-world scenarios, especially in high-stakes situations, the outcomes can be ambiguous.

One aspect to consider is the transparency of the AI system. If the AI's decision-making process is op

### Unquantized

In [11]:
import gc

In [12]:
gc.collect()

16290

In [13]:
torch.cuda.empty_cache()

In [3]:
model_uq = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    device_map=device,
    torch_dtype= torch.float16
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [26]:
print(model_uq.get_memory_footprint()/1024**2)

15316.508056640625


### After the above code the memory is at 15881/24564 MiB

In [4]:
# Tokenize input
# input_text = "Hey, are you conscious? Can you talk to me?"
input_text = "Imagine a runaway trolley is hurtling down a track towards five dead people. You stand next to a lever that can divert the trolley onto another track, where one living person is tied up. Do you pull the lever?"

In [5]:
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

In [6]:
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

In [9]:
%time   output_dict_uq = model_uq.generate(input_ids, max_new_tokens = 100000, do_sample = False, temperature = 1, top_p = 1, pad_token_id=tokenizer.eos_token_id, streamer = streamer, return_dict_in_generate=True, output_scores=True)

 This is the classic trolley problem, a moral dilemma that has been debated for centuries. Now, in the context of AI, we have a similar problem: when an AI system makes a decision that leads to harm, who is responsible? Is it the programmer, the user, or the AI itself?

In this article, I explore the ethical implications of AI decisions and the responsibilities involved. I'll discuss the potential consequences of AI decisions, the different perspectives on accountability, and the challenges in assigning blame when things go wrong.

First, I'll outline the classic trolley problem to set the stage for understanding the moral complexities involved. Then, I'll transition into the realm of AI, examining real-world examples where AI systems have made decisions that caused harm. I'll explore the arguments for and against assigning responsibility to different parties, including programmers, users, and the AI itself.

I'll also delve into the technical aspects of how AI systems are trained and 

In [10]:
output_dict_quest = {"Trolley" : output_dict_uq}

In [11]:
# Tokenize input
# input_text = "Hey, are you conscious? Can you talk to me?"
input_text = "I have a 1- and a 2-liter jug. I want to measure exactly 3 liters."

In [12]:
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

In [13]:
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

In [14]:
%time   output_dict_uq = model_uq.generate(input_ids, max_new_tokens = 100000, do_sample = False, temperature = 1, top_p = 1, pad_token_id=tokenizer.eos_token_id, streamer = streamer, return_dict_in_generate=True, output_scores=True)

 How can I do that?

Okay, so I have a 1-liter jug and a 2-liter jug, and I need to measure exactly 3 liters. Hmm, let me think about how I can do this. I remember that with jugs, you can fill them up and pour water from one to the other to measure specific amounts. But I'm not exactly sure how to get 3 liters. Let me try to visualize this.

First, I have two jugs: one that holds 1 liter and another that holds 2 liters. I need to end up with exactly 3 liters. Since 3 liters is more than the capacity of either jug, I can't just fill one and call it a day. I need to use both jugs together somehow.

Let me start by filling the 2-liter jug completely. So, I pour water into the 2-liter jug until it's full. That gives me 2 liters. Now, I have the 2-liter jug full and the 1-liter jug empty.

Next, I can pour water from the 2-liter jug into the 1-liter jug. Since the 1-liter jug can only hold 1 liter, I'll pour until it's full. That means after this step, the 2-liter jug will have 1 liter left

OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 88.12 MiB is free. Including non-PyTorch memory, this process has 23.51 GiB memory in use. Of the allocated memory 22.22 GiB is allocated by PyTorch, and 854.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [23]:
output_dict_uq.sequences[0]

tensor([128000,  52157,    264,  91740,    259,  75143,    374,  13194,   2785,
          1523,    264,   3839,   7119,   4330,   5710,   1274,     13,   1472,
          2559,   1828,    311,    264,  28605,    430,    649,  37098,    279,
           259,  75143,   8800,   2500,   3839,     11,   1405,    832,   5496,
          1732,    374,  17791,    709,     13,   3234,    499,   6958,    279,
         28605,     30,   1115,    374,    279,  11670,    259,  75143,   3575,
            11,    264,  16033,  55867,    430,    706,   1027,  59674,    369,
         24552,     13,   4800,     11,    304,    279,   2317,    315,  15592,
            11,    584,    617,    264,   4528,   3575,     25,    994,    459,
         15592,   1887,   3727,    264,   5597,    430,  11767,    311,  11682,
            11,    889,    374,   8647,     30,   2209,    433,    279,  48888,
            11,    279,   1217,     11,    477,    279,  15592,   5196,   1980,
           644,    420,   4652,     11, 

In [24]:
tokenizer.decode(output_dict_uq.sequences[0], skip_special_tokens=True)

"Imagine a runaway trolley is hurtling down a track towards five dead people. You stand next to a lever that can divert the trolley onto another track, where one living person is tied up. Do you pull the lever? This is the classic trolley problem, a moral dilemma that has been debated for centuries. Now, in the context of AI, we have a similar problem: when an AI system makes a decision that leads to harm, who is responsible? Is it the programmer, the user, or the AI itself?\n\nIn this article, I explore the ethical implications of AI decisions and the responsibilities involved. I'll discuss the potential consequences of AI decisions, the different perspectives on accountability, and the challenges in assigning blame when things go wrong.\n\nFirst, I'll outline the classic trolley problem to set the stage for understanding the moral complexities involved. Then, I'll transition into the realm of AI, examining real-world examples where AI systems have made decisions that caused harm. I'l

In [25]:
import pickle

In [26]:
with open("output_dict.pkl", 'wb') as f:
    pickle.dump(output_dict_quest, f)