In [1]:
#@title bert question+context simple
import torch
from transformers import BertForQuestionAnswering, BertTokenizer

# Load pre-trained model and tokenizer
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
model = BertForQuestionAnswering.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

# Define context and question
context = """
Machine learning is a field of artificial intelligence that uses statistical techniques to give computer systems the ability
to learn from data without being explicitly programmed.
"""
question = "What is machine learning?"

# Tokenize input
inputs = tokenizer(question, context, return_tensors="pt", truncation=True)

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)

# Extract logits and find answer span
start_idx = torch.argmax(outputs.start_logits)
end_idx = torch.argmax(outputs.end_logits)

# Decode the answer
answer_tokens = inputs['input_ids'][0][start_idx:end_idx+1]
answer = tokenizer.decode(answer_tokens)

print("Question:", question)
print("Answer:", answer)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



Question: What is machine learning?
Answer: a field of artificial intelligence


In [2]:
pip install gradio

Collecting gradio
  Downloading gradio-4.44.0-py3-none-any.whl.metadata (15 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0 (from gradio)
  Downloading fastapi-0.114.1-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.3.0 (from gradio)
  Downloading gradio_client-1.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting orjson~=3.0 (from gradio)
  Downloading orjson-3.10.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m622.1 kB/s[0m eta [36m0:00:00[0m
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.9 (from

In [3]:
#@title bert context linear regression gradio

import gradio as gr
import requests
from bs4 import BeautifulSoup
import torch
from transformers import BertForQuestionAnswering, BertTokenizer

# Load pre-trained BERT model and tokenizer
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
model = BertForQuestionAnswering.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

# Scrape content from Wikipedia
def get_wikipedia_content():
    url = "https://en.wikipedia.org/wiki/Linear_regression"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract paragraphs
    content = ''
    for paragraph in soup.find_all('p'):
        content += paragraph.text
    return content

# BERT Question Answering
def answer_question(question):
    context = get_wikipedia_content()  # Get the Wikipedia content

    # Tokenize input
    inputs = tokenizer(question, context, return_tensors="pt", truncation=True)

    # Perform inference
    with torch.no_grad():
        outputs = model(**inputs)

    # Extract logits and find answer span
    start_idx = torch.argmax(outputs.start_logits)
    end_idx = torch.argmax(outputs.end_logits)

    # Decode the answer
    answer_tokens = inputs['input_ids'][0][start_idx:end_idx+1]
    answer = tokenizer.decode(answer_tokens)
    return answer

# Define Gradio interface
interface = gr.Interface(
    fn=answer_question,
    inputs="text",  # Input for the question
    outputs="text",  # Output will be the generated answer
    title="BERT Question Answering on Wikipedia (Linear Regression)",
    description="Ask a question based on the content of the Linear Regression Wikipedia article."
)

# Launch the Gradio interface
interface.launch()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://600f3f04ff25965712.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




In [4]:
#@title widgets with gpt2
import ipywidgets as widgets
from IPython.display import display
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

# Load pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# GPT-2 requires padding token setup
tokenizer.pad_token = tokenizer.eos_token

# Create input widget for question
question_input = widgets.Text(
    value='',
    placeholder='Type your question here',
    description='Question:',
    disabled=False
)

output_label = widgets.Label(value="Answer will be shown here.")
submit_button = widgets.Button(description="Get Answer")

# Function to generate a response using GPT-2
def generate_response(button):
    question = question_input.value

    # Tokenize input with attention mask
    inputs = tokenizer(question, return_tensors="pt", padding=True)
    attention_mask = inputs['attention_mask']

    # Generate response using attention mask
    outputs = model.generate(inputs['input_ids'], attention_mask=attention_mask,
                             max_length=150, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)

    # Decode and display response
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    output_label.value = f"Answer: {answer}"

# Link the button to the function
submit_button.on_click(generate_response)

# Display the widgets
display(question_input, submit_button, output_label)


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



Text(value='', description='Question:', placeholder='Type your question here')

Button(description='Get Answer', style=ButtonStyle())

Label(value='Answer will be shown here.')

In [6]:
#@title dependencies gradio torch transformers
pip install gradio torch transformers

SyntaxError: invalid syntax (<ipython-input-6-2229c021cdae>, line 2)

In [5]:
#@title gradio with gpt2
import gradio as gr
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

# Load pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# GPT-2 requires padding token setup
tokenizer.pad_token = tokenizer.eos_token

# Function to generate response with tunable parameters
def generate_response(prompt, max_length, num_return_sequences, temperature, top_k):
    # Tokenize the input prompt with attention mask
    inputs = tokenizer(prompt, return_tensors="pt", padding=True)
    attention_mask = inputs['attention_mask']

    # Generate the response using tunable parameters
    outputs = model.generate(
        inputs['input_ids'],
        attention_mask=attention_mask,
        max_length=max_length,
        num_return_sequences=num_return_sequences,
        temperature=temperature,
        top_k=top_k,
        pad_token_id=tokenizer.eos_token_id
    )

    # Decode the generated response
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer

# Create Gradio Interface
interface = gr.Interface(
    fn=generate_response,  # The function to run
    inputs=[
        gr.Textbox(label="Input Prompt", placeholder="Type your question or prompt here"),  # Text input for prompt
        gr.Slider(50, 300, step=10, value=150, label="Max Length"),  # Slider for max_length
        gr.Slider(1, 5, step=1, value=1, label="Number of Return Sequences"),  # Slider for num_return_sequences
        gr.Slider(0.1, 1.0, step=0.1, value=0.7, label="Temperature"),  # Slider for temperature (creativity)
        gr.Slider(0, 100, step=5, value=50, label="Top K Sampling")  # Slider for top_k (sampling diversity)
    ],
    outputs="text",  # Output is the generated text
    title="GPT-2 Text Generation with Tunable Parameters",
    description="Enter a prompt and tune parameters like max_length, number of return sequences, temperature, and top_k."
)

# Launch the interface
interface.launch()


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://c6532b89cfd69c16f1.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




In [None]:
#@title exploded since memory exceeds colab free limit
import gradio as gr
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load pre-trained GPT-J model and tokenizer from Hugging Face
model_name = "EleutherAI/gpt-j-6B"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# GPT-J also requires padding token setup
tokenizer.pad_token = tokenizer.eos_token

# Function to generate response with tunable parameters
def generate_response(prompt, max_length, num_return_sequences, temperature, top_k):
    # Tokenize the input prompt with attention mask
    inputs = tokenizer(prompt, return_tensors="pt", padding=True)
    attention_mask = inputs['attention_mask']

    # Generate the response using tunable parameters
    outputs = model.generate(
        inputs['input_ids'],
        attention_mask=attention_mask,
        max_length=max_length,
        num_return_sequences=num_return_sequences,
        temperature=temperature,
        top_k=top_k,
        pad_token_id=tokenizer.eos_token_id
    )

    # Decode the generated response
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer

# Create Gradio Interface
interface = gr.Interface(
    fn=generate_response,  # The function to run
    inputs=[
        gr.Textbox(label="Input Prompt", placeholder="Type your question or prompt here"),  # Text input for prompt
        gr.Slider(50, 300, step=10, value=150, label="Max Length"),  # Slider for max_length
        gr.Slider(1, 5, step=1, value=1, label="Number of Return Sequences"),  # Slider for num_return_sequences
        gr.Slider(0.1, 1.0, step=0.1, value=0.7, label="Temperature"),  # Slider for temperature (creativity)
        gr.Slider(0, 100, step=5, value=50, label="Top K Sampling")  # Slider for top_k (sampling diversity)
    ],
    outputs="text",  # Output is the generated text
    title="GPT-J Text Generation with Tunable Parameters",
    description="Enter a prompt and tune parameters like max_length, number of return sequences, temperature, and top_k."
)

# Launch the interface
interface.launch()


config.json:   0%|          | 0.00/930 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/24.2G [00:00<?, ?B/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/gradio/queueing.py", line 536, in process_events
    response = await route_utils.call_process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/route_utils.py", line 322, in call_process_api
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1935, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1520, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(

# Note on models cross checking online sources

Training a GPT-3 level model (such as OpenAI's GPT-3 with 175 billion parameters) from scratch requires an enormous amount of computational resources, time, and cost. Here's an overview of the major resource requirements:

### 1. **Compute Power and Cost for Training**
Training GPT-3 from scratch required around **355 GPU-years** on V100 GPUs. Given this scale, the estimated cost for a single training run is approximately **$4.6 million** using cloud infrastructure. The training process consumes about **3.14 x 10²³ FLOPS** (floating-point operations) and requires advanced infrastructure to handle both memory and computational demands【21†source】【22†source】.

For large models like GPT-3, the training needs a vast amount of memory, with the 175 billion parameters alone requiring about **700GB of memory** in FP32 precision. This necessitates specialized hardware setups, such as model parallelism to distribute the load across multiple GPUs【21†source】.

### 2. **Fine-Tuning Resources and Cost**
Fine-tuning a pre-trained GPT-3 model is a more cost-effective approach than training from scratch. The fine-tuning process typically involves:
- **Training Cost**: For OpenAI’s GPT-3.5 Turbo, the cost is around **$0.008 per 1,000 tokens** for training and **$0.012 per 1,000 tokens** for input during usage. The output generation costs around **$0.016 per 1,000 tokens**. For example, fine-tuning a model on 100,000 tokens across 3 epochs would cost around **$2.40**【23†source】.
  
### 3. **Infrastructure Setup**
Training and fine-tuning at this scale require specialized hardware, such as clusters of GPUs (e.g., NVIDIA A100s), or AI accelerators like the Cerebras CS-2. Many organizations opt for managed services or cloud-based setups to reduce the complexity of managing these high-performance clusters【22†source】.

If your goal is to fine-tune a GPT-3-level model rather than train one from scratch, fine-tuning offers a more accessible path, allowing you to customize the model to specific tasks without incurring the massive costs associated with full-scale training.

Let me know if you'd like further details on fine-tuning or model-specific recommendations!

Training GPT-2 is significantly less resource-intensive than training GPT-3. Here's an overview of the resources required for GPT-2:

### 1. **Compute Power for GPT-2**
GPT-2 has multiple versions, ranging from 124 million to 1.5 billion parameters. Training the largest version of GPT-2 (1.5 billion parameters) from scratch requires approximately **256 GPU-days** on a V100 GPU setup. In comparison to GPT-3, the cost and time for GPT-2 are significantly lower because the model is much smaller.

- **Parameters**:
  - GPT-2 (Small) = 124M
  - GPT-2 (Medium) = 355M
  - GPT-2 (Large) = 774M
  - GPT-2 (XL) = 1.5B

- **Training Time**: Training GPT-2 1.5B requires around **256 V100 GPU-days**, which translates to around **$50,000–$100,000** depending on cloud provider pricing and infrastructure【21†source】【25†source】.

### 2. **Memory Requirements**
GPT-2's memory requirements are more manageable compared to GPT-3. For example, GPT-2 XL (1.5B parameters) requires around **6 GB of memory** for parameter storage, which is well within the capacity of most modern GPUs (like the V100 or A100, which offer 32–40 GB of VRAM). This makes it easier to train and fine-tune on typical GPU setups without resorting to model parallelism【21†source】.

### 3. **Fine-Tuning GPT-2**
Fine-tuning GPT-2 is a more accessible option for most organizations and individuals, as it can be done with much lower compute and costs compared to GPT-3. Depending on the specific task, you can fine-tune GPT-2 with as few as a few thousand examples.

For fine-tuning GPT-2:
- **GPU Requirement**: Fine-tuning can often be done on a single GPU (V100 or A100).
- **Cost**: Fine-tuning typically costs a few hundred dollars depending on the size of the dataset and the number of fine-tuning steps.
  
### 4. **Inference Cost**
Inference for GPT-2 is also much cheaper. For deployment in production, GPT-2 can serve as a more cost-efficient model for generating text where the absolute scale and accuracy of GPT-3 aren’t necessary.

### Conclusion
If you need a language model with good performance but lower computational requirements and costs, GPT-2 (particularly the larger versions) can be a strong candidate. Fine-tuning GPT-2 is significantly cheaper than GPT-3, and it’s easier to run on commodity hardware.

Training GPT-2 is significantly less resource-intensive than training GPT-3. Here's an overview of the resources required for GPT-2:

### 1. **Compute Power for GPT-2**
GPT-2 has multiple versions, ranging from 124 million to 1.5 billion parameters. Training the largest version of GPT-2 (1.5 billion parameters) from scratch requires approximately **256 GPU-days** on a V100 GPU setup. In comparison to GPT-3, the cost and time for GPT-2 are significantly lower because the model is much smaller.

- **Parameters**:
  - GPT-2 (Small) = 124M
  - GPT-2 (Medium) = 355M
  - GPT-2 (Large) = 774M
  - GPT-2 (XL) = 1.5B

- **Training Time**: Training GPT-2 1.5B requires around **256 V100 GPU-days**, which translates to around **$50,000–$100,000** depending on cloud provider pricing and infrastructure【21†source】【25†source】.

### 2. **Memory Requirements**
GPT-2's memory requirements are more manageable compared to GPT-3. For example, GPT-2 XL (1.5B parameters) requires around **6 GB of memory** for parameter storage, which is well within the capacity of most modern GPUs (like the V100 or A100, which offer 32–40 GB of VRAM). This makes it easier to train and fine-tune on typical GPU setups without resorting to model parallelism【21†source】.

### 3. **Fine-Tuning GPT-2**
Fine-tuning GPT-2 is a more accessible option for most organizations and individuals, as it can be done with much lower compute and costs compared to GPT-3. Depending on the specific task, you can fine-tune GPT-2 with as few as a few thousand examples.

For fine-tuning GPT-2:
- **GPU Requirement**: Fine-tuning can often be done on a single GPU (V100 or A100).
- **Cost**: Fine-tuning typically costs a few hundred dollars depending on the size of the dataset and the number of fine-tuning steps.
  
### 4. **Inference Cost**
Inference for GPT-2 is also much cheaper. For deployment in production, GPT-2 can serve as a more cost-efficient model for generating text where the absolute scale and accuracy of GPT-3 aren’t necessary.

### Conclusion
If you need a language model with good performance but lower computational requirements and costs, GPT-2 (particularly the larger versions) can be a strong candidate. Fine-tuning GPT-2 is significantly cheaper than GPT-3, and it’s easier to run on commodity hardware.

# model comparison in table format

Here’s a comparison table summarizing key characteristics of the models mentioned:

| **Model**        | **Architecture**            | **Parameters**      | **Training Cost**                | **Primary Use Case**              | **Key Characteristics** |
|------------------|-----------------------------|---------------------|----------------------------------|-----------------------------------|--------------------------|
| **GPT-2**        | Decoder-only Transformer     | 1.5 billion         | ~$50k to $100k                   | Text generation, language tasks   | Lower cost, flexible, good for general NLP |
| **GPT-3**        | Decoder-only Transformer     | 175 billion         | ~$4.6 million for one training   | Advanced text generation, Q&A     | High performance, zero-shot/few-shot learning |
| **GPT-J**        | Decoder-only Transformer     | 6 billion           | Lower than GPT-3                 | Text generation, general-purpose  | Open-source alternative to GPT-3 |
| **BERT**         | Encoder-only Transformer     | 345 million (Large) | Lower training cost              | Extractive Q&A, understanding tasks | Best for tasks needing context understanding |

Let me know if you need additional details about these models!