### 1. Detect CUDA version and install the pre-built binaries

In [None]:
import subprocess
import sys

# Check for GPU
try:
    gpu_info = subprocess.check_output(["nvidia-smi"]).decode("utf-8")
    print("GPU Detected:\n", gpu_info.split('\n')[2])
except:
    raise ValueError("‚ùå NO GPU DETECTED! Go to Runtime > Change runtime type > T4 GPU")

# Install the library. Try cu121 if cu124 doesn't work
!python -m pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124 \
  --upgrade --force-reinstall --no-cache-dir

print("Installation Complete.")

‚úÖ GPU Detected:
 | NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu124
Collecting llama-cpp-python
  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.16-cu124/llama_cpp_python-0.3.16-cp312-cp312-linux_x86_64.whl (551.3 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m551.3/551.3 MB[0m [31m311.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-extensions>=4.5.0 (from llama-cpp-python)
  Downloading typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Collecting numpy>=1.20.0 (from llama-cpp-python)
  Downloading numpy-2.3.5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (62 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

## 2. Import libraries

In [None]:
import gc
import ctypes
import torch
from llama_cpp import Llama

### Free GPU memory method

In [None]:
def free_memory(model_instance):
    # Close the model connection
    if hasattr(model_instance, 'close'):
        model_instance.close()
        print("Model connection closed.")

    # Delete the Python object to release references
    del model_instance
    print("Python object deleted.")

    # Force Python's Garbage Collector to clear the object
    gc.collect()

    # Force the C++ backend to release memory to the OS
    # This specifically trims the heap memory that llama.cpp allocates
    try:
        libc = ctypes.CDLL("libc.so.6")
        libc.malloc_trim(0)
        print("C++ memory freed")
    except Exception as e:
        print(f"Could not free C++ memory: {e}")

    print("\n GPU Memory should now be freed.")

### ÊúÄÂ•ΩÁî®Ëã±ÊñáÈ¢òÔºÅ

# üöÄ Local Inference on GPU (GPU Êú¨Âú∞Êé®ÁêÜ)

This notebook runs the **BigJuicyData/Anni-Q4_K_M-GGUF** model locally on your specific GPU instance. (T4 GPU in Google Colab)

</br>

# **üöÄ GPU Êú¨Âú∞Êé®ÁêÜ**

Êú¨ Notebook Â∞áÂú®ÊÇ®ÁâπÂÆöÁöÑ GPU ÂØ¶‰æãÔºàGoogle Colab ‰∏≠ÁöÑ **T4 GPU**Ôºâ‰∏äÊú¨Âú∞ÈÅãË°å **BigJuicyData/Anni-Q4_K_M-GGUF** Ê®°Âûã„ÄÇ

</br>

---
</br>

### How to Use (Â¶Ç‰Ωï‰ΩøÁî®)
1. **Edit the Prompt:** Scroll down to the code cell and find the variable `PROMPT = "..."`. Replace the text inside with your specific coding question.
2. **Run the Cell:** Press the Play button (‚ñ∂Ô∏è) or `Shift + Enter`.
3. **Wait for Stream:** The model will "think" first (showing logic), followed by the code.


### Â¶Ç‰Ωï‰ΩøÁî®
1. **Á∑®ËºØÊèêÁ§∫Ë©û (Prompt)Ôºö** Âêë‰∏ãÊªæÂãïÂà∞Á®ãÂºèÁ¢ºÂñÆÂÖÉÊ†ºÔºåÊâæÂà∞ËÆäÊï∏ `PROMPT = "..."`„ÄÇÂ∞áÂÖ∂‰∏≠ÁöÑÊñáÂ≠óÊõøÊèõÁÇ∫ÊÇ®ÂÖ∑È´îÁöÑÁ®ãÂºèË®≠Ë®àÂïèÈ°å„ÄÇ
2. **ÈÅãË°åÂñÆÂÖÉÊ†ºÔºö** Êåâ‰∏ãÊí≠ÊîæÊåâÈàï (‚ñ∂Ô∏è) Êàñ `Shift + Enter` Èçµ„ÄÇ
3. **Á≠âÂæÖÊµÅÂºèËº∏Âá∫Ôºö** Ê®°ÂûãÂ∞áÂÖàÈÄ≤Ë°å„ÄåÊÄùËÄÉ„ÄçÔºàÈ°ØÁ§∫ÈÇèËºØÔºâÔºåÈö®ÂæåËº∏Âá∫Á®ãÂºèÁ¢º„ÄÇ

</br>

---

</br>

### Where is the output? (Ëº∏Âá∫Âú®Âì™Ë£°?)
The output streams at the **very bottom** of the cell output area. Look for a block formatted like this:

```python
# The output will look like this and is wrapped inside
# ```python
def solution():
    pass
# ```
```

### Ëº∏Âá∫Âú®Âì™Ë£°Ôºü
Ëº∏Âá∫Â∞áÂú®ÂñÆÂÖÉÊ†ºËº∏Âá∫ÂçÄÂüüÁöÑÊúÄÂ∫ïÈÉ®‰ª•ÊµÅÂºèÂÇ≥Ëº∏„ÄÇË´ãÂ∞ãÊâæÂ¶Ç‰∏ãÊ†ºÂºèÁöÑÁ®ãÂºèÁ¢ºÂ°äÔºö

```python
# The output will look like this and is wrapped inside
# ```python
def solution():
    pass
# ```
```

</br>

---

</br>

#### **IF YOU ENCOUNTER AN ERROR, do this!** Go to Runtime > Restart session and run all (Â¶ÇÊûúÊÇ®ÈÅáÂà∞ÈåØË™§ÔºåË´ãÈÄôÊ®£ÂÅöÔºÅ)
- Restart the session to clear the GPU RAM!

</br>

#### **Â¶ÇÊûúÊÇ®ÈÅáÂà∞ÈåØË™§ÔºåË´ãÈÄôÊ®£ÂÅöÔºÅ** ÂâçÂæÄ Runtime (ÈÅãË°åÊôÇ) > Restart session (ÈáçÊñ∞ÂïüÂãïÊúÉË©±) and run all (‰∏¶ÈÅãË°åÂÖ®ÈÉ®ÂñÆÂÖÉÊ†º)

- ÈáçÊñ∞ÂïüÂãïÊúÉË©±‰ª•Ê∏ÖÈô§ GPU Ë®òÊÜ∂È´î (RAM)ÔºÅ

</br>
---

Model page (Ê®°ÂûãÈ†ÅÈù¢):
1. [GGUF](https://huggingface.co/BigJuicyData/coder-final-Q4_K_M-GGUF)
2. **VLLM**: [Huggingface](https://huggingface.co/BigJuicyData/coder-final) , [Modelscope](https://modelscope.cn/models/quanteat/coder-final)
3. [MLX](https://huggingface.co/BigJuicyData/coder-final-mlx-4Bit)

In [None]:
### IF YOU ENCOUNTER AN ERROR, do this! Go to Runtime > Restart session and run all

REPO_ID = "BigJuicyData/Anni-Q4_K_M-GGUF"
FILENAME = "coder-final-q4_k_m.gguf"
CONTEXT_SIZE = 24000

# --------------------------------------------
# REPLACE THIS WITH YOUR PROGRAMMING QUESTION!
# Áî®‰Ω†ÁöÑÁºñÁ®ãÈóÆÈ¢òÊõøÊç¢Ê≠§Â§ÑÔºÅÊúÄÂ•ΩÁî®Ëã±ÊñáÈ¢ò„ÄÇ
# --------------------------------------------
PROMPT="""
Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target.

You may assume that each input would have exactly one solution, and you may not use the same element twice.

You can return the answer in any order.



Example 1:

Input: nums = [2,7,11,15], target = 9
Output: [0,1]
Explanation: Because nums[0] + nums[1] == 9, we return [0, 1].

Example 2:

Input: nums = [3,2,4], target = 6
Output: [1,2]

Example 3:

Input: nums = [3,3], target = 6
Output: [0,1]



Constraints:

    2 <= nums.length <= 104
    -109 <= nums[i] <= 109
    -109 <= target <= 109
    Only one valid answer exists.


Follow-up: Can you come up with an algorithm that is less than O(n2) time complexity?
"""

# Initialize the model
llm = Llama.from_pretrained(
    repo_id=REPO_ID,
    filename=FILENAME,
    n_ctx=CONTEXT_SIZE,
    n_gpu_layers=-1,
    verbose=True
)

# Define the messages (Chat Format)
messages = [
    {
        "role": "system",
        "content": """You are Qwen. You are an expert algorithmic assistant.
1. CORE OBJECTIVE: Solve the user's problem using the most efficient algorithm (optimal Time/Space Complexity).
2. THOUGHT PROCESS: Use your <think> tag to analyze edge cases and plan the logic.
3. OUTPUT: After reasoning, provide the Python solution."""
    },
    {
        "role": "user",
        "content": PROMPT,
    }
]

# Run Inference
print(f"\n Generating with {CONTEXT_SIZE} token limit...\n")

output = llm.create_chat_completion(
    messages=messages,
    max_tokens=CONTEXT_SIZE,
		temperature=0.6,
    top_p=0.95,
    top_k=20,
    min_p=0.0,
    stream=True
)

# Print the streamed response
for chunk in output:
    delta = chunk['choices'][0]['delta']
    if 'content' in delta:
        print(delta['content'], end="", flush=True)

# Info
print("\nÂèØ‰ª•Áõ¥Êé•Â§çÂà∂‰ª•‰∏äÁöÑ‰ª£Á†Å„ÄÇ‰ª£Á†Å‰∏∫ ```python ``` ÂåÖÂÆπ„ÄÇ")

# Free the memory
print("\n-------- Freeing memory --------")
free_memory(llm)
print("-------- Allocated GPU memory should be empty now --------")

llama_model_load_from_file_impl: using device CUDA0 (Tesla T4) - 14980 MiB free
llama_model_loader: loaded meta data with 25 key-value pairs and 443 tensors from /root/.cache/huggingface/hub/models--BigJuicyData--coder-final-Q4_K_M-GGUF/snapshots/ee6cc085a771a1641d29006cb7f721cc1f70c98d/./coder-final-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Coder Final
llama_model_loader: - kv   3:                         general.size_label str              = 15B
llama_model_loader: - kv   4:                          qwen3.block_count u32              = 40
llama_model_loader: - kv   5:                       qwen3.context_leng


 Generating with 24000 token limit...

Okay, I need to solve this two-sum problem. Let's think about the examples given. The task is to find two numbers in the array that add up to the target and return their indices. The constraints say that each input has exactly one solution, and I can't use the same element twice. Oh right, and the follow-up asks for an algorithm better than O(n¬≤), so I need to think about efficient methods.

The brute force approach would be to check every pair of elements, but that's O(n¬≤) time, which is probably too slow for large arrays. The user wants something better. So, what's a more efficient way?

Hmm, using a hash map (dictionary in Python) could work. The idea is to store the values we've seen so far along with their indices. For each number in the array, we check if the complement (target - current number) exists in the hash map. If it does, we return the current index and the index from the hash map. If not, we add the current number to the hash ma

llama_perf_context_print:        load time =     715.25 ms
llama_perf_context_print: prompt eval time =     714.72 ms /   322 tokens (    2.22 ms per token,   450.52 tokens per second)
llama_perf_context_print:        eval time =   58473.03 ms /   942 runs   (   62.07 ms per token,    16.11 tokens per second)
llama_perf_context_print:       total time =   61956.25 ms /  1264 tokens
llama_perf_context_print:    graphs reused =        912



 -------- Freeing memory --------
Model connection closed.
Python object deleted.
C++ memory freed

 GPU Memory should now be freed.
-------- Allocated GPU memory should be empty now --------
