# BEFORE YOU START, CHANGE THE RUNTIME TO REQUEST A T4 GPU

In [1]:
# Installs
# You will need at a minimum the following packages. Feel free to install
# additional ones as needed
!pip install google-generativeai
!pip install datasets
!pip install -U bitsandbytes
!pip install transformers
!pip install -U peft
!pip install -U "huggingface_hub[cli]"
!pip install -U trl

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [None]:
import google.generativeai as genai
from datasets import Dataset, DatasetDict
import pandas as pd
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, \
    BitsAndBytesConfig, TrainingArguments, pipeline, logging
import torch
from trl import SFTTrainer

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# In this block, I include code for parsing the Intro chapter from a text file.
# I ran the following two lines on a linux terminal. You can find the equi-
# valent for your OS or find an online tool for converting PDF into text.
# >>> pdftotext -nopgbrk MIT_6390_chapter_Introduction.pdf
# >>> sed -r ':a /[a-zA-Z,\ ]$/N;s/(.)\n/\1 /;ta' \
#        MIT_6390_chapter_Introduction.txt > \
#        MIT_6390_chapter_Introduction_reformat.txt
#
# Once the PDF was converted to a text file, I manually looked through it to:
# - Remove ninformative lines (e.g., "Last updated: ...", "MIT 6390", ...)
# - Remove comments, which come somewhat poorly organized
# - Remove double line breaks
# - Fix up equations a bit so that they made sense in text format
#
# I did all this in a simple text editor (I used Sublime Text). Then I simply
# uploaded the file to Colab and ran the following code to split the text into
# informative paragraphs. This required a bit of iterating back and forth to
# make sure that no paragraph was "trailing" from the previous one.

import glob

for file in glob.glob("/content/drive/MyDrive/MIT_6390_chapter_Introduction_reformat.txt"):
  with open(file) as f:
    lines = f.readlines()

  paragraphs = []
  min_chars = 200
  par = lines[0]
  for ln in lines[1:]:
    if ln[0] == "•" or ln[0].isdigit() and ln[1:3] == ". ":
      # Part of a list, combine with previous items
      par += ln
    else:
      paragraphs.append(par.strip())  # Remove trailing whitespace and store
      par = ln               # Start new paragraph

  paragraphs.append(par)

for line in paragraphs:
  print(line)
  print('---')
print(len(paragraphs))

# I have uploaded the reformatted text file I used to generate data for the
# Intro chapter. You should follow a similar process to create data from any
# source you wish to use. Note: the better you clean up your data, the more
# useful your final model will be.

Introduction  The main focus of machine learning (ML) is making decisions or predictions based on data.
---
There are a number of other fields with significant overlap in technique, but difference in focus: in economics and psychology, the goal is to discover underlying causal processes and in statistics it is to find a model that fits a data set well. In those fields, the end product is a model. In machine learning, we often fit models, but as a means to the end of making good predictions or decisions.
---
As ML methods have improved in their capability and scope, ML has become arguably the best way–measured in terms of speed, human engineering time, and robustness–to approach many applications. Great examples are face detection, speech recognition, and many kinds of language-processing tasks. Almost any application that involves understanding data or signals that come from the real world can be nicely addressed using machine learning.
---
One crucial aspect of machine learning approa

In [None]:
print(paragraphs[8])

1.1 Problem class  There are many different problem classes in machine learning. They vary according to what kind of data is provided and what kind of conclusions are to be drawn from it. Five standard problem classes are described below, to establish some notation and terminology.


In [None]:
# This block contains the code to interact with the Google Gemini 1.5 Flash API
# to request questions and answers for your data. It loops through a list of
# paragraphs and requests Gemini to create one question for each paragraph indi-
# vidually. Your main job here is to write appropriate prompts that lead Gemini
# to generate useful questions. You can also consider generating more/fewer
# questions per paragraph, or merging paragraphs if you think that will help.
# Be sure to document any changes you make in your report!

# TODO: create your own Gemini API key, and either paste it here (and then
# remove it before turning in your report) or save it in a file and load it here.
geminiApiKey="AIzaSyDTgtMv4zB-01k1yyjShjmMR3E9h2vQpjI"
genai.configure(api_key=geminiApiKey)
cfg = genai.types.GenerationConfig(max_output_tokens=4000)
sys_msg_train = (
'''

   Please generate a concise question for each paragraph given that the paragraghs are serpated by the text ESE 577: DLAS, followed by a clear corresponding answer for each of the question such that the question and answers can be used as training data to train an LLM on deep learning.

'''
)
print(sys_msg_train)
print()
model_train = genai.GenerativeModel('gemini-1.5-flash', system_instruction=sys_msg_train)
qa_pairs_train = []
for par in paragraphs[:5]:
  qa_pairs_train.append(model_train.generate_content(par, generation_config=cfg).text)
  print(qa_pairs_train[-1])



   Please generate a concise question for each paragraph given that the paragraghs are serpated by the text ESE 577: DLAS, followed by a clear corresponding answer for each of the question such that the question and answers can be used as training data to train an LLM on deep learning.



**Question:** What is the primary goal of machine learning?

**Answer:** Making decisions or predictions based on data.

**Question:** What is the primary difference between the application of modeling in machine learning versus fields like economics, psychology, and statistics?

**Answer:** While economics, psychology, and statistics aim to create models to understand underlying causal processes or achieve a good data fit, machine learning uses model fitting primarily as a tool for improving prediction and decision-making.



ERROR:tornado.access:503 POST /v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 380.71ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 810.85ms


**Question:** What are some advantages of using Machine Learning (ML) for applications like face detection, speech recognition, and language processing, compared to other approaches?

**Answer:** ML offers superior speed, reduced human engineering time, and increased robustness compared to other methods when tackling applications involving real-world data and signals, such as face detection, speech recognition, and language processing.



ERROR:tornado.access:503 POST /v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 3519.94ms


**Question:** What key role does human engineering play in machine learning problem-solving, encompassing the various stages involved?

**Answer:** Human engineering is crucial in framing the problem, acquiring and organizing data, designing the solution space, selecting algorithms and parameters, applying the algorithm, validating solutions, and assessing the societal impact of deployment.  These steps are essential for successful machine learning application.

**Question:** What fundamental philosophical problem underlies the concept of learning from data, and how do we address it in practice?

**Answer:** The fundamental problem is induction – the assumption that past data predicts the future. We address it by making operational assumptions, such as the data being independently and identically distributed (i.i.d.) and the testing data coming from the same distribution as the training data, or that the answer lies within a predefined set of possibilities.



In [None]:
# Install necessary libraries
!pip install google-generativeai



In [None]:
# Import required libraries
import os
import json
import google.generativeai as genai
from google.colab import files
from IPython.display import display, HTML

# Configure the Gemini API
geminiApiKey = "API_KEY"
genai.configure(api_key=geminiApiKey)
model = genai.GenerativeModel('gemini-pro')
cfg = genai.types.GenerationConfig(max_output_tokens=4000)

# Function to upload and read a file
def upload_and_read_file():
    uploaded = files.upload()
    file_name = next(iter(uploaded))
    with open(file_name, 'r', encoding='utf-8') as file:
        content = file.read()
    return content

# Function to generate Q&A pairs
def generate_qa(content):
    prompt = f"""
    print hello

    {content.strip()}

    Format the output as:
    Q: [Question]
    A: [Answer]

    Separate each Q&A pair with a blank line.
    """

    response = model.generate_content(prompt, generation_config=cfg)
    q_and_a_pairs = response.text.split("\n\n")  # Assuming pairs are separated by double newlines

    qa_list = []

    for pair in q_and_a_pairs:
        if "Q:" in pair and "A:" in pair:
            question = pair.split("Q:")[1].split("A:")[0].strip()
            answer = pair.split("A:")[1].strip()
            qa_list.append({"question": question, "answer": answer})

    return qa_list

# Main execution
print("Please upload your input text file.")
file_content = upload_and_read_file()

qa_pairs = generate_qa(file_content)

# Save Q&A pairs to JSON file
json_filename = 'qa_pairs.json'
with open(json_filename, 'w', encoding='utf-8') as json_file:
    json.dump(qa_pairs, json_file, ensure_ascii=False, indent=2)

print(f"Questions and answers saved to {json_filename}")

# Display the first few Q&A pairs
display(HTML("<h3>Sample Q&A Pairs:</h3>"))
for pair in qa_pairs[:5]:  # Display first 5 pairs
    display(HTML(f"<b>Q:</b> {pair['question']}<br><b>A:</b> {pair['answer']}<br><br>"))

# Option to download the JSON file
files.download(json_filename)

Please upload your input text file.


Saving HW1QA.txt to HW1QA (5).txt
Questions and answers saved to qa_pairs.json


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
sys_msg_val = (

'''

  Please generate a concise question for each paragraph given that the paragraghs, followed by a clear corresponding answer for each of the question such that the question and answers can be used as valisation data to validate a finetuned LLM on deep learning.

'''

)
print(sys_msg_val)
print()
model_val = genai.GenerativeModel('gemini-1.5-flash', system_instruction=sys_msg_val)
qa_pairs_val = []
for par in paragraphs[:5]:
  qa_pairs_val.append(model_val.generate_content(par, generation_config=cfg).text)
  print(qa_pairs_val[-1])



  Please generate a concise question for each paragraph given that the paragraghs, followed by a clear corresponding answer for each of the question such that the question and answers can be used as valisation data to validate a finetuned LLM on deep learning.





ERROR:tornado.access:503 POST /v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 6500.47ms


**Question:** What is the primary goal of machine learning?

**Answer:** Making decisions or predictions based on data.



ERROR:tornado.access:503 POST /v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 382.58ms


**Question:** What is the primary difference between the application of modeling in machine learning versus fields like economics, psychology, and statistics?

**Answer:**  While economics, psychology, and statistics aim to create models that explain underlying causal processes or fit datasets, machine learning primarily uses models as tools to improve prediction and decision-making.  The focus is on the application of the model rather than the model itself.

**Question:** What are some advantages of using machine learning (ML) for applications involving real-world data and signals, and what are some example applications?

**Answer:**  ML offers advantages in speed, reduced human engineering time, and robustness compared to other methods.  Examples include face detection, speech recognition, and various language processing tasks.  It's well-suited for applications that involve understanding real-world data and signals.

**Question:** What significant human role is highlighted in the ap

In [None]:
# Create HuggingFace datasets
dataset_train = Dataset.from_pandas(pd.DataFrame(qa_pairs_train, columns=["text"]))
dataset_val = Dataset.from_pandas(pd.DataFrame(qa_pairs_val, columns=["text"]))
dataset = DatasetDict({"train": dataset_train, "test": dataset_val})
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 5
    })
    test: Dataset({
        features: ['text'],
        num_rows: 5
    })
})


In [None]:
def format_data(example):
    return {
        "text": f"Question: {example['text'].split('**Question:**')[1].split('**Answer:**')[0].strip()}\n\nAnswer: {example['text'].split('**Answer:**')[1].strip()}"
    }

dataset["train"] = dataset["train"].map(format_data)
dataset["val"] = dataset["val"].map(format_data)

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

In [None]:
print(dataset["train"])

Dataset({
    features: ['text'],
    num_rows: 5
})


In [None]:
from huggingface_hub import login

# Replace 'your_access_token' with the actual token you generated
login(token="your_access_token")

In [None]:
# Load the model -- Skeleton
base_model = "mistralai/Mistral-7B-Instruct-v0.2"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)
model.config.use_cache = False
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [None]:
# Tokenize the data
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.padding_side = "right"
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.bos_token, tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

('<s>', '</s>')

In [None]:
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=512,
        return_tensors="pt"
    )
    return {
        "input_ids": tokenized["input_ids"].squeeze().to(torch.long),
        "attention_mask": tokenized["attention_mask"].squeeze().to(torch.long)
    }

# Apply to both train and test datasets
tokenized_train = Dataset.from_dict(tokenize_function(dataset["train"]))
tokenized_eval = Dataset.from_dict(tokenize_function(dataset["test"]))

In [None]:
print(tokenized_train[0])

{'input_ids': [1, 22478, 28747, 1824, 349, 272, 6258, 5541, 302, 5599, 5168, 28804, 13, 13, 2820, 16981, 28747, 19387, 9549, 442, 20596, 2818, 356, 1178, 28723, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

In [None]:
# LoRA config -- Skeleton
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, peft_config)

In [None]:
# Hyperparameters -- Skeleton
training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=500,
    save_steps=1000,
    gradient_accumulation_steps=8,
    fp16=True,
    learning_rate=2e-4,
    remove_unused_columns=False,
    push_to_hub=False,
)

In [None]:
# Trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    peft_config=peft_config,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
    max_seq_length=512,
    dataset_text_field="text"
)
trainer.train()


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss


TrainOutput(global_step=3, training_loss=0.7526021003723145, metrics={'train_runtime': 30.3472, 'train_samples_per_second': 0.494, 'train_steps_per_second': 0.099, 'total_flos': 328287356190720.0, 'train_loss': 0.7526021003723145, 'epoch': 3.0})

In [None]:
# Save the model
trainer.model.save_pretrained("ESE577_chatbot")
model.config.use_cache = True
model.eval()

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k

In [None]:
# Run the model locally
logging.set_verbosity(logging.CRITICAL)
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200, truncation=True)
def build_prompt(question):
  prompt = f"<s>[INST]@ESE577. {question}. [/INST]"
  return prompt

while True:
  question = input("Enter your ESE577-related question (hit Enter to exit): ").strip()
  if not question:
    break
  prompt = build_prompt(question)
  answer = pipe(prompt)
  print(answer[0]["generated_text"])
  print()


Enter your ESE577-related question (hit Enter to exit): what is deep learning?
<s>[INST]@ESE577. what is deep learning?. [/INST]1. Deep Learning is a subfield of Machine Learning, which is a type of artificial intelligence (AI) that is modeled after the human brain. It involves training artificial neural networks with multiple layers to learn and represent data in increasingly abstract and complex ways. These neural networks can learn to recognize patterns and make decisions based on large amounts of data, without being explicitly programmed to do so. Deep learning models can be used for various applications such as image and speech recognition, natural language processing, and predictive modeling. The "deep" in deep learning refers to the deep neural networks that are used, which have many layers and can learn hierarchical representations of data.

