In [1]:
!pip install transformers accelerate peft bitsandbytes datasets torch


Collecting bitsandbytes
  Downloading bitsandbytes-0.45.2-py3-none-manylinux_2_24_x86_64.whl.metadata (5.8 kB)
Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_6

In [2]:
from google.colab import drive
import os
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
txt_path = input("Enter the full path where the transcription should be saved (e.g., transcription.txt): ")
os.environ["TEXT_OUTPUT"] = txt_path
text_file_path = os.getenv("TEXT_OUTPUT")  # Transcription file

In [16]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

# Use 4-bit quantization to reduce memory usage
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",  # Normal Float-4 for better efficiency
    device_map="auto"
)

# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)


Unused kwargs: ['device_map']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


In [17]:
# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Define a prompt
prompt = "What do you know about cryptography?"

# Tokenize input
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Generate response
output = model.generate(**inputs, max_new_tokens=1000)

# Decode and print response
response = tokenizer.decode(output[0], skip_special_tokens=True)
print(response)


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


What do you know about cryptography? What is the difference between symmetric and asymmetric cryptography?
I need to provide a detailed answer.
I need to provide a detailed answer.
I need to provide a detailed answer.
Okay, so I need to explain what cryptography is, and then talk about the differences between symmetric and asymmetric cryptography. I'm a bit new to this, so I'll have to make sure I understand each part properly.

First, what is cryptography? I think it's about secure communication methods. Maybe it's like encoding messages so that only the intended person can read them. I remember hearing about encryption, which is a method to do that. So, cryptography is the practice for protecting information. It involves creating algorithms and protocols to keep data secret. I think it's used in many areas like online transactions, secure messaging, and even in protecting data at rest.

Now, symmetric cryptography. From what I recall, this uses the same key for both encryption and de

In [19]:
from datasets import Dataset

# Read the text file
with open(text_file_path, "r", encoding="utf-8") as file:
    raw_text = file.read()

# Step 2: Preprocess the Data
# Break long text into chunks (split by sentences or fixed-size)
chunk_size = 512  # Adjust as needed
text_chunks = [raw_text[i:i+chunk_size] for i in range(0, len(raw_text), chunk_size)]

# Convert into structured dataset format
data = [{"text": chunk} for chunk in text_chunks]

# Convert to Hugging Face dataset
dataset = Dataset.from_list(data)

def tokenize_function(examples):
    tokenized_inputs = tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

    # Shift labels left for next-letter prediction
    input_ids = tokenized_inputs["input_ids"]
    labels = input_ids.copy()

    for i in range(len(labels)):  # Shift all labels left by 1 position
        labels[i] = labels[i][1:] + [tokenizer.pad_token_id]  # Append PAD token at the end

    tokenized_inputs["labels"] = labels
    return tokenized_inputs


# Step 4: Apply Tokenization
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Print a sample
print("Sample Processed Data:")
print(dataset[0])  # Show first entry


Map:   0%|          | 0/58 [00:00<?, ? examples/s]

Sample Processed Data:
{'text': " So we can kind of put that under the foot, that under the rug. And then one question came up about proof by cases. Well, cases are kind of the different scenarios that can happen. And if you can prove there's something happens in every possible scenario, that something is going to always be true. So if I can prove something happens if it's raining, and if it's not raining, then it happens all the time. And even in all, there was one example of two cases. You can consider for a new teacher, because those ar"}


In [21]:
from peft import LoraConfig, get_peft_model

# Define LoRA configuration
lora_config = LoraConfig(
    r=8,  # Rank size (small number to save memory)
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],  # Target specific layers
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


trainable params: 1,089,536 || all params: 1,778,177,536 || trainable%: 0.0613


In [22]:
from transformers import TrainingArguments, Trainer

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=20,
    save_strategy="epoch",
    logging_dir="./logs",
    evaluation_strategy="epoch",
    report_to="none"
)

# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets,
)

# Start training
trainer.train()




Epoch,Training Loss,Validation Loss
1,No log,7.357521
2,No log,5.279541
3,No log,3.155069
4,No log,2.592206
5,No log,2.509021
6,No log,2.455338
7,No log,2.39235
8,No log,2.323343
9,No log,2.252383
10,No log,2.184067


TrainOutput(global_step=140, training_loss=2.9194774082728796, metrics={'train_runtime': 418.685, 'train_samples_per_second': 2.771, 'train_steps_per_second': 0.334, 'total_flos': 4831058869616640.0, 'train_loss': 2.9194774082728796, 'epoch': 17.551724137931036})

In [23]:
# Define a prompt
prompt = "What do you know about cryptography?"

# Tokenize input
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Generate response
output = model.generate(**inputs, max_new_tokens=1000)

# Decode and print response
response = tokenizer.decode(output[0], skip_special_tokens=True)
print(response)

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


What do you know about cryptography? What is the ...
What is the difference between symmetric and asymmetric cryptography? What is the difference between key sizes? What is the difference between encryption and decryption?

What is the difference between symmetric and asymmetric cryptography? What is the difference between key sizes? What is the difference between encryption and decryption?
What is the difference between symmetric and asymmetric cryptography? What is the difference between key sizes? What is the difference between encryption and decryption?
What is the difference between symmetric and asymmetric cryptography? What is the difference between key sizes? What is the difference between encryption and decryption?
What is the difference between symmetric and asymmetric cryptography? What is the difference between key sizes? What is the difference between encryption and decryption?
What is the difference between symmetric and asymmetric cryptography? What is the difference bet

In [10]:
# model.save_pretrained("./fine_tuned_deepseek")
# tokenizer.save_pretrained("./fine_tuned_deepseek")


('./fine_tuned_deepseek/tokenizer_config.json',
 './fine_tuned_deepseek/special_tokens_map.json',
 './fine_tuned_deepseek/tokenizer.json')

In [11]:
# from transformers import AutoModelForCausalLM, AutoTokenizer

# model = AutoModelForCausalLM.from_pretrained("./fine_tuned_deepseek")
# tokenizer = AutoTokenizer.from_pretrained("./fine_tuned_deepseek")
