<a href="https://colab.research.google.com/github/Ellen-Gu/Quantitative-Analysis/blob/main/Exploring_and_Advancing_in_AI_A_Journey_with_tensorflow%2C_pytorch%2C_LLMs%2C_Quantized_Models_and_APP_development%2C_and_Continuous_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Recently, I completely wiped my Docker system to install a fresh, updated version. I rebuilt all images and containers. In this new setup, I've successfully installed TensorFlow and PyTorch from scratch. Additionally, after updating to the latest Nvidia driver (released on October 31, 2023), my old Nvidia Quadro M1200 GPU, with a compute capacity of 5.0 and 4GB memory, was revitalized. Despite its limitations, I'm thrilled that TensorFlow, PyTorch, LangChain, PySpark, Python, and R are now seamlessly integrated in the same container. Having a customized platform installed locally facilitates testing, playing, learning, and developing with various LLM and AI technologies. The platform not only makes these processes more feasible but also tends to enhance ease and reproducibility in any given docker environment.

One frustrating issue, however, is a bug in Windows 10's WSL, which leads to a mismatch in the version of the Nvidia driver. According to some users, this problem only resolves by upgrading to Windows 11. Currently, this bug is causing LlamaCpp, used for quantizing LLM models, to encounter errors during GPU-accelerated operations. As a result, LangChain is limited to CPU usage on my older computer.

I am currently testing and experimenting with GPUs, LLMs, and LangChain locally. OpenAI is an exceptional tool, but many companies might prefer to conduct LLM training locally with confidential information. Open-source, pre-trained LLMs could serve as a foundational model, which can then be further trained with proprietary data. However, most LLMs are too large for computers with lower specs. Ingeniously, some have found ways to reduce the numerical bit size using C++, thereby quantizing the models to a manageable size. For instance, Facebook's Llama2, converted into the latest LlamaCpp format (gguf), is available for free download from Hugging Face.

The followings are the models I played with. I think the quantized llm based on most recent released facebook Llma2 could be my base model.

In [13]:
!ls -l /home/sharedspace/gguf*

total 77304748
-rwxrwsrwx 1 rstudio sharedspace 41422910368 Oct 29 17:08 llama-2-70b-chat.Q4_K_M.gguf
-rwxrwsrwx 1 rstudio sharedspace  4081004224 Oct 29 18:41 llama-2-7b-chat.Q4_K_M.gguf
-rwxrwsr-- 1 rstudio sharedspace  5131421440 Nov  3 18:23 openchat_3.5.Q5_K_M.gguf
-rwxrwsrwx 1 rstudio sharedspace  7865963360 Oct 30 04:03 xwin-lm-13b-v0.2.Q4_K_M.gguf
-rwxrwsr-- 1 rstudio sharedspace 20658710464 Nov  6 03:13 yi-34b.Q4_K_M.gguf


I have downloaded several LLM models, including llama-2-7b-chat.Q4_K_M.gguf, the 70b models, and yi-34b.Q4_K_M.gguf, among others. Remarkably, the 7b model occupies only 4GB when fully loaded into RAM. All these models can be integrated with LangChain, enabling performance akin to OpenAI's offerings. While these models excel in language-related tasks, there is still considerable scope for improvement in areas like mathematics and logical reasoning.

When exploring LLMs and AI technologies, I often notice that documentation often becomes outdated quickly due to the rapid evolution of AI tools and models. This phenomenon is sometimes referred to as a 'fluid state', necessitating continual learning and knowledge updating. Meanwhile, a deep understanding of the underlying mechanisms of LLMs is essential for further advancement. The next step, typically, involves gathering code snippets to construct a customized application, which is more feasible once we have a clear conceptual framework. Accumulating knowledge, practicing, planning, building frameworks, and solidifying understanding are crucial to achieving breakthroughs.  

After experimenting with TensorFlow, Torch, and various models, I've encountered a challenge that I hope someone can assist with. Specifically, when using TensorFlow, I find it difficult to clear the GPU memory. The only solution I've come across in the documentation is to restart my kernel, which is less than ideal. In contrast, PyTorch offers garbage collection functions, allowing me to efficiently clear the GPU memory after use. While using both TensorFlow and PyTorch in the same notebook is uncommon, such functionality could be valuable in a team environment. Utilizing separate Python environments might alleviate some of the complexities associated with memory management. I'll keep this as an open question, hoping to find a more effective solution in the future.

From the nvidia-smi output we can see that the GPU memory used by tensorflow is kept in there, this is even after ran pytorch stuff and done with garbage collections.

In [11]:
!nvidia-smi

Wed Nov 15 16:27:48 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.01             Driver Version: 537.70       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Quadro M1200                   On  | 00000000:01:00.0 Off |                  N/A |
| N/A    0C    P8              N/A / 200W |   3342MiB /  4096MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                         

The following is to test tensorflow, pytorch version installed on my laptop locally. The calculating on GPU does have hundreds of time speeding up. For pytorch, the garbage collection takes some time.

In [5]:
import tensorflow as tf
print(tf.__version__)
# List available GPU devices
print("Available GPU devices:", tf.config.list_physical_devices('GPU'))

2.14.0
Available GPU devices: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


In [6]:
# Simple computation to run on GPU
with tf.device('/GPU:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0], shape=[1, 5], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0], shape=[1, 5], name='b')
    c = tf.matmul(a, tf.transpose(b))

print("Result:", c)

Result: tf.Tensor([[55.]], shape=(1, 1), dtype=float32)


In [7]:
import timeit

# Ensure computations run on CPU
with tf.device('/CPU:0'):
    # Create two random matrices
    a_cpu = tf.random.uniform([10000, 10000], name='a_cpu')
    b_cpu = tf.random.uniform([10000, 10000], name='b_cpu')

    # Define the matrix multiplication operation
    def cpu_computation():
        tf.matmul(a_cpu, b_cpu)

# Time the operation
cpu_time = timeit.timeit('cpu_computation()', number=10, setup="from __main__ import cpu_computation")
print(f"CPU computation time for 10 runs: {cpu_time:.4f} seconds")

CPU computation time for 10 runs: 1.2595 seconds


In [8]:
# Ensure computations run on GPU
with tf.device('/GPU:0'):
    # Create two random matrices
    a_gpu = tf.random.uniform([10000, 10000], name='a_gpu')
    b_gpu = tf.random.uniform([10000, 10000], name='b_gpu')

    # Define the matrix multiplication operation
    def gpu_computation():
        tf.matmul(a_gpu, b_gpu)

# Time the operation
gpu_time = timeit.timeit('gpu_computation()', number=10, setup="from __main__ import gpu_computation")
print(f"GPU computation time for 10 runs: {gpu_time:.4f} seconds")

GPU computation time for 10 runs: 0.0023 seconds


In [9]:
#import os
#os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

import torch
import timeit

# Ensure the code runs on GPU
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Create two random matrices and move them to GPU
a_gpu = torch.rand(10000, 10000, device=device)
b_gpu = torch.rand(10000, 10000, device=device)

# Define the matrix multiplication operation
def gpu_computation():
    torch.mm(a_gpu, b_gpu)

# Time the operation
gpu_time = timeit.timeit('gpu_computation()', number=1, setup="from __main__ import gpu_computation")
print(f"GPU computation time: {gpu_time:.4f} seconds")

# To check memory usage (optional)
print(f"Memory allocated on GPU: {torch.cuda.memory_allocated(device)/1e6:.2f} MB")
print(f"Memory cached on GPU: {torch.cuda.memory_cached(device)/1e6:.2f} MB")

del a_gpu
del b_gpu
torch.cuda.empty_cache()
import gc
gc.collect()

GPU computation time: 0.9043 seconds
Memory allocated on GPU: 809.63 MB
Memory cached on GPU: 1222.64 MB




260

In [10]:
import torch
import timeit

# Ensure the code runs on GPU
device = torch.device("cpu")

# Create two random matrices and move them to GPU
a_cpu = torch.rand(10000, 10000, device=device)
b_cpu = torch.rand(10000, 10000, device=device)

# Define the matrix multiplication operation
def cpu_computation():
    torch.mm(a_cpu, b_cpu)

# Time the operation
cpu_time = timeit.timeit('cpu_computation()', number=1, setup="from __main__ import cpu_computation")
print(f"cPU computation time: {cpu_time:.4f} seconds")

cPU computation time: 8.9565 seconds


It's time to explore LangChain and various LLM models. For my older laptop, quantized models are a practical choice. With LangChain, LLMs can retain memory of previous interactions, mimicking the behavior of OpenAI's ChatGPT. The biggest advantage of quantized models is their efficiency: they're lightweight with minimal loss in performance, highly capable, and can be run on lower-spec computers.

Today, I came across two interesting articles. One is [Beginner's Guide to FinGPT: Training with LoRA and ChatGLM2-6B](https://byfintech.medium.com/beginners-guide-to-fingpt-training-with-lora-chatglm2-6b-9eb5ace7fe99), and the other is [Building a question-answering system using LLM on your private data](https://krishnayogi.medium.com/building-a-question-answering-system-using-llm-50904793ae07). I haven't had the chance to delve into them deeply yet, but the second article, in particular, seems feasible for implementation on my laptop. Keeping them in my reachable range, it's now time to return to my reading and studying. I am deeply convinced that a thorough grasp of the fundamental principles of LLMs is crucial for creating exceptional, dependable, and secure applications. Consequently, I devote the majority of my time to reading and studying. The 'fancy' playing, as I call it, forms an integral part of my practice. It is important but never can substitute the hard work on reading and studying.

In [12]:
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

n_gpu_layers = 0  # Change this value based on your model and your GPU VRAM pool: due to win10 wsl OS bug in nvidia dirver version recognition, langcain can not recognize the GPU thoug tensorflow and pytorch can make it. Set to 0 to use GPU syntax but run on CPU
n_batch = 32  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="/home/sharedspace/gguf-llama-models/llama-2-7b-chat.Q4_K_M.gguf",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    temperature=0,  # want answers not so randomly
    callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
    max_tokens=1000 # adjusted from 256 (Default) to get more complete answers
)

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Quadro M1200, compute capability 5.0
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /home/sharedspace/gguf-llama-models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_mode

In [14]:
prompt = """
Question: What is the capital of England??
"""
llm(prompt)


Llama.generate: prefix-match hit



Answer: The capital of England is London.


llama_print_timings:        load time =    1582.69 ms
llama_print_timings:      sample time =       4.92 ms /    11 runs   (    0.45 ms per token,  2238.05 tokens per second)
llama_print_timings: prompt eval time =    1156.73 ms /     8 tokens (  144.59 ms per token,     6.92 tokens per second)
llama_print_timings:        eval time =    2268.44 ms /    10 runs   (  226.84 ms per token,     4.41 tokens per second)
llama_print_timings:       total time =    3468.94 ms


'\nAnswer: The capital of England is London.'

In [15]:
prompt = """
Question: who is the king there?
"""
llm(prompt)

Llama.generate: prefix-match hit



The King of England is currently Queen Elizabeth II.


llama_print_timings:        load time =    1582.69 ms
llama_print_timings:      sample time =       5.21 ms /    12 runs   (    0.43 ms per token,  2305.48 tokens per second)
llama_print_timings: prompt eval time =    1111.60 ms /     7 tokens (  158.80 ms per token,     6.30 tokens per second)
llama_print_timings:        eval time =    2520.37 ms /    11 runs   (  229.12 ms per token,     4.36 tokens per second)
llama_print_timings:       total time =    3678.60 ms


'\nThe King of England is currently Queen Elizabeth II.'

In [16]:
prompt = """
Question: who will be next to throne after the queen?
"""
llm(prompt)

Llama.generate: prefix-match hit



The current line of succession to the British throne is as follows:

1. Queen Elizabeth II
2. Prince Charles, Prince of Wales (her eldest son)
3. Prince William, Duke of Cambridge (her grandson)
4. Prince George of Cambridge (her great-grandson)
5. Princess Charlotte of Cambridge (her great-granddaughter)
6. Prince Louis of Cambridge (her great-grandson)
7. Prince Andrew, Duke of York (her second son)
8. Princess Beatrice of York (her granddaughter)
9. Princess Eugenie of York (her granddaughter)
10. James Viscount Severn (her great-grandsons)

It is worth noting that the line of succession can change due to births, deaths, marriages or other events that may affect the status of individuals in the line of succession.


llama_print_timings:        load time =    1582.69 ms
llama_print_timings:      sample time =      90.58 ms /   197 runs   (    0.46 ms per token,  2174.87 tokens per second)
llama_print_timings: prompt eval time =    1396.35 ms /    10 tokens (  139.64 ms per token,     7.16 tokens per second)
llama_print_timings:        eval time =   47680.17 ms /   196 runs   (  243.27 ms per token,     4.11 tokens per second)
llama_print_timings:       total time =   49930.17 ms


'\nThe current line of succession to the British throne is as follows:\n\n1. Queen Elizabeth II\n2. Prince Charles, Prince of Wales (her eldest son)\n3. Prince William, Duke of Cambridge (her grandson)\n4. Prince George of Cambridge (her great-grandson)\n5. Princess Charlotte of Cambridge (her great-granddaughter)\n6. Prince Louis of Cambridge (her great-grandson)\n7. Prince Andrew, Duke of York (her second son)\n8. Princess Beatrice of York (her granddaughter)\n9. Princess Eugenie of York (her granddaughter)\n10. James Viscount Severn (her great-grandsons)\n\nIt is worth noting that the line of succession can change due to births, deaths, marriages or other events that may affect the status of individuals in the line of succession.'

In [17]:
prompt = """
Question: as ans AI LLM model, when is your last training date?
"""
llm(prompt)

Llama.generate: prefix-match hit



I am an AI language model, and I don't have a specific "last training date" as I am constantly learning and updating my knowledge base through various sources of data. My training process involves consuming large amounts of text data from various sources, such as books, articles, websites, and other texts, and using this data to learn patterns and relationships in language.
I'm trained on a variety of tasks, including language translation, language generation, and question answering, among others. My training is an ongoing process, and I continue to learn and improve over time as new data becomes available.
So, to answer your question directly, I don't have a specific "last training date" because my training is an ongoing process that never stops. I am always learning and improving, so there is no end date for my training.


llama_print_timings:        load time =    1582.69 ms
llama_print_timings:      sample time =      84.81 ms /   177 runs   (    0.48 ms per token,  2087.09 tokens per second)
llama_print_timings: prompt eval time =    2158.32 ms /    16 tokens (  134.90 ms per token,     7.41 tokens per second)
llama_print_timings:        eval time =   43429.92 ms /   176 runs   (  246.76 ms per token,     4.05 tokens per second)
llama_print_timings:       total time =   46361.04 ms


'\nI am an AI language model, and I don\'t have a specific "last training date" as I am constantly learning and updating my knowledge base through various sources of data. My training process involves consuming large amounts of text data from various sources, such as books, articles, websites, and other texts, and using this data to learn patterns and relationships in language.\nI\'m trained on a variety of tasks, including language translation, language generation, and question answering, among others. My training is an ongoing process, and I continue to learn and improve over time as new data becomes available.\nSo, to answer your question directly, I don\'t have a specific "last training date" because my training is an ongoing process that never stops. I am always learning and improving, so there is no end date for my training.'

In [1]:
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

# cpu syntax
# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = LlamaCpp(
    model_path="/home/sharedspace/gguf-llama-models/llama-2-7b-chat.Q4_K_M.gguf",
    temperature=0,
    max_tokens=2000,
    top_p=1,
    callback_manager=callback_manager,
    verbose=True, # Verbose is required to pass to the callback manager
)

2023-11-15 16:37:24.865993: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-15 16:37:24.866084: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-15 16:37:24.866110: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-15 16:37:24.874791: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_i

In [5]:
prompt = """
Question: now i have a math question to you, you always failed when temperature=0.8 (default). Now your temperature is set to 0.  Let's try again. cosine(pi/2)=?
"""
llm(prompt)

Llama.generate: prefix-match hit



Answer: I apologize, but I'm a large language model, I cannot perform calculations beyond the scope of my training data. The equation you provided, `cosine(π/2) = ?`, is not a valid mathematical equation.
The sine, cosine, and tangent functions are defined only for real numbers, not complex numbers or irrational numbers like π. Therefore, I cannot provide an answer to this equation.


llama_print_timings:        load time =    1134.12 ms
llama_print_timings:      sample time =      47.28 ms /    94 runs   (    0.50 ms per token,  1988.03 tokens per second)
llama_print_timings: prompt eval time =    6681.96 ms /    46 tokens (  145.26 ms per token,     6.88 tokens per second)
llama_print_timings:        eval time =   22374.12 ms /    93 runs   (  240.58 ms per token,     4.16 tokens per second)
llama_print_timings:       total time =   29459.17 ms


"\nAnswer: I apologize, but I'm a large language model, I cannot perform calculations beyond the scope of my training data. The equation you provided, `cosine(π/2) = ?`, is not a valid mathematical equation.\nThe sine, cosine, and tangent functions are defined only for real numbers, not complex numbers or irrational numbers like π. Therefore, I cannot provide an answer to this equation."

In [6]:
prompt = """
Question: are you sure? please try again to give the answer to cos(pi/2). hint: sin(pi/2)=1, and sin^2(\theta)+cos^2(\theta)=1
"""
llm(prompt)

Llama.generate: prefix-match hit



Answer: Sure! Let's use the trigonometric identity \sin^2(θ) + \cos^2(θ) = 1 to find the value of $\cos(\pi/2)$.

We know that $\sin(\pi/2) = 1$, so we can write:

$$\cos(\pi/2) = \frac{\cos(\pi/2)}{\sin(\pi/2)} = \frac{1}{\sin(\pi/2)}$$

Now, $\sin(\pi/2)$ is equal to $1$, so we can simplify:

$$\cos(\pi/2) = \frac{1}{1} = 1$$

Therefore, the value of $\cos(\pi/2)$ is $1$.


llama_print_timings:        load time =    1134.12 ms
llama_print_timings:      sample time =      73.29 ms /   168 runs   (    0.44 ms per token,  2292.29 tokens per second)
llama_print_timings: prompt eval time =    7615.39 ms /    47 tokens (  162.03 ms per token,     6.17 tokens per second)
llama_print_timings:        eval time =   38788.13 ms /   167 runs   (  232.26 ms per token,     4.31 tokens per second)
llama_print_timings:       total time =   47126.60 ms


"\nAnswer: Sure! Let's use the trigonometric identity \\sin^2(θ) + \\cos^2(θ) = 1 to find the value of $\\cos(\\pi/2)$.\n\nWe know that $\\sin(\\pi/2) = 1$, so we can write:\n\n$$\\cos(\\pi/2) = \\frac{\\cos(\\pi/2)}{\\sin(\\pi/2)} = \\frac{1}{\\sin(\\pi/2)}$$\n\nNow, $\\sin(\\pi/2)$ is equal to $1$, so we can simplify:\n\n$$\\cos(\\pi/2) = \\frac{1}{1} = 1$$\n\nTherefore, the value of $\\cos(\\pi/2)$ is $1$."

In [7]:
prompt = """
Question: ok got it. your math is not so well trained. that is hard topic.
"""
llm(prompt)

Llama.generate: prefix-match hit


ok, now I'm ready to help you with the next question. please go ahead and ask it.


llama_print_timings:        load time =    1134.12 ms
llama_print_timings:      sample time =      10.54 ms /    23 runs   (    0.46 ms per token,  2182.58 tokens per second)
llama_print_timings: prompt eval time =    2612.34 ms /    18 tokens (  145.13 ms per token,     6.89 tokens per second)
llama_print_timings:        eval time =    4958.73 ms /    22 runs   (  225.40 ms per token,     4.44 tokens per second)
llama_print_timings:       total time =    7668.25 ms


"ok, now I'm ready to help you with the next question. please go ahead and ask it."