# Large Language Models

In the ever-evolving landscape of artificial intelligence, large language models (LLMs) have emerged as transformative tools, reshaping the way we engage with and analyse language. These sophisticated models, honed on massive repositories of text data, possess the remarkable ability to comprehend, generate, and translate human language with unprecedented accuracy and fluency. Among the prominent LLM architectures, LangChain stands out for its efficiency and flexibility.

This notebook is designed to seamlessly run both locally and on Google Colab. For those who may only have a CPU, there are clear instructions on how to run the notebook without a GPU. Don't worry, simply follow the instructions for either GPU or CPU, depending on your setup.

Please note that using only a CPU will result in noticeably slower model performance.

---
## 1.&nbsp; Installations and Settings 🛠️

On Google Colab, you have access to free GPUs, whenever they're available. Let's utilise this advantage. To configure a Colab GPU, navigate to "Edit" and then "Notebook Settings". Select "GPU" and then click "Save".

To proceed, you'll need to install two libraries: Langchain and Llama.cpp. When operating this notebook locally, you only need to install these libraries once, and they'll remain on your computer. However, in Colab, they're not default libraries and must be installed for each session.

**LangChain** is a framework that simplifies the development of applications powered by large language models (LLMs)

**llama.cpp** enables us to execute quantised versions of models.

> Quantisation of LLMs is a process that reduces the precision of the numerical values in the model, such as converting 32-bit floating-point numbers to 8-bit integers. These models are therefore smaller and faster, allowing them to run on less powerful hardware with only a small loss in precision.

* If you're using a **CPU**, use the [standard installation](https://python.langchain.com/docs/integrations/llms/llamacpp#cpu-only-installation) of llama.cpp. Windows users might have to install [a couple of extra libraries too](https://python.langchain.com/docs/integrations/llms/llamacpp#installation-with-windows). Some students using windows have also found [this guide](https://medium.com/@piyushbatra1999/installing-llama-cpp-python-with-nvidia-gpu-acceleration-on-windows-a-short-guide-0dfac475002d) useful.
* If you have an **NVIDIA GPU**, you need to [activate cuBLAS](https://python.langchain.com/docs/integrations/llms/llamacpp#installation-with-openblas-cublas-clblast) with llama.cpp. cuBLAS is a library that speeds up operations on NVIDIA GPUs.
* If you have a **silicon chip Apple with a GPU**, you need to [enable Metal](https://python.langchain.com/docs/integrations/llms/llamacpp#installation-with-metal).

In [None]:
!CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python

Before we dive into the examples, let's download the large language model (LLM) we'll be using. For these exercises, we've selected a [quantised version of Mistral AI's Mistral 7B model](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF). While this is a great choice, it's by no means the only option. We encourage you to explore and try different models to discover the unique strengths and weaknesses of each. Even models of similar size can exhibit surprisingly different capabilities.

> Since we're working in Colab, we'll need to download the LLM for each session.
<br>
If you're working locally, you can download the model once. The model is then on your computer and doesn't need to be downloaded each time. Change the `--local-dir` to your folder of choice.

In [None]:
pip install huggingface-hub
#[run it in the terminal]

In [None]:
#run it in the terminal
!huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GGUF mistral-7b-instruct-v0.1.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

---
## 2.&nbsp; Setting up your LLM 🧠

Langchain simplifies LLM deployment with its streamlined setup process. A single line of code configures your LLM, allowing you to tailor the parameters to your specific needs.

If you want to know more about Llama.cpp, you can [read the docs here](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/). Alternatively, here are the [LangChain docs for Llama.cpp](https://python.langchain.com/docs/integrations/llms/llamacpp).

Here's a brief overview of some of the parameters:
* **model_path:** The path to the Llama model file that will be used for generating text.
* **max_tokens:** The maximum number of tokens that the model should generate in its response.
* **temperature:** A value between 0 and 1 that controls the randomness of the model's generation. A lower temperature results in more predictable, constrained output, while a higher temperature yields more creative and diverse text.
* **top_p:** A value between 0 and 1 that controls the diversity of the model's predictions. A higher top_p value prioritizes the most probable tokens, while a lower top_p value encourages the model to explore a wider range of possibilities.
* **n_gpu_layers:** The default setting of 0 will cause all layers to be executed on the CPU. Setting n_gpu_layers to 1 will cause the first layer of the model to be executed on the GPU, while the remaining layers are executed on the CPU. Setting n_gpu_layers to 2 will cause the first two layers of the model to be executed on the GPU, while the remaining layers are executed on the CPU, and so on. -1 will cause all layers to be offloaded to the GPU. In general, it is a good idea to experiment with different values of n_gpu_layers to find the best balance between performance and memory usage for your specific application.

In [1]:
from langchain.llms import LlamaCpp

llm = LlamaCpp(model_path = "/Users/sadiakhanrupa/Bootcamp Main Phase/Chapter_8_generative_Ai/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
               max_tokens = 2000,
               temperature = 0.1,
               top_p = 1,
               n_gpu_layers = -1)

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/sadiakhanrupa/Bootcamp Main Phase/Chapter_8_generative_Ai/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32           

If you're using a GPU, check the output of this cell ☝️
  * If you're using cuBLAS, you'll see `BLAS = 1` if it's installed correctly.
  * If you're using Metal, you'll see `NEON = 1` if it's installed correctly.

---
## 3.&nbsp; Asking your LLM questions 🤖
Play around and note how small changes make a big difference.

In [2]:
answer_1 = llm.invoke("Which animals live at the north pole?")
print(answer_1)


llama_print_timings:        load time =    6205.29 ms
llama_print_timings:      sample time =     106.10 ms /   503 runs   (    0.21 ms per token,  4740.86 tokens per second)
llama_print_timings: prompt eval time =    6205.24 ms /     8 tokens (  775.65 ms per token,     1.29 tokens per second)
llama_print_timings:        eval time =   27044.97 ms /   503 runs   (   53.77 ms per token,    18.60 tokens per second)
llama_print_timings:       total time =   34857.13 ms /   511 tokens




1. Polar Bears
2. Arctic Foxes
3. Walruses
4. Caribou
5. Beluga Whales
6. Narwhals
7. Seals
8. Musk Oxen
9. Arctic Hares
10. Snowy Owls
11. Reindeer
12. Beavers
13. Moose
14. Lynx
15. Wolverines
16. Caribou
17. Arctic Hares
18. Beluga Whales
19. Narwhals
20. Seals
21. Musk Oxen
22. Arctic Foxes
23. Polar Bears
24. Walruses
25. Caribou
26. Beluga Whales
27. Narwhals
28. Seals
29. Musk Oxen
30. Arctic Hares
31. Snowy Owls
32. Reindeer
33. Beavers
34. Moose
35. Lynx
36. Wolverines
37. Caribou
38. Arctic Hares
39. Beluga Whales
40. Narwhals
41. Seals
42. Musk Oxen
43. Arctic Foxes
44. Polar Bears
45. Walruses
46. Caribou
47. Beluga Whales
48. Narwhals
49. Seals
50. Musk Oxen
51. Arctic Hares
52. Snowy Owls
53. Reindeer
54. Beavers
55. Moose
56. Lynx
57. Wolverines
58. Caribou
59. Arctic Hares
60. Beluga Whales
61. Narwhals
62. Seals
63. Musk Oxen
64. Arctic Foxes
65. Polar Bears
66. Walruses
67. Caribou
68. Beluga Whales
69. Narwhals
70. Seals
71. Musk Oxen


In [3]:
answer_2 = llm.invoke("what do you mean by Generative AI")
print(answer_2)

Llama.generate: prefix-match hit

llama_print_timings:        load time =    6205.29 ms
llama_print_timings:      sample time =      31.56 ms /   114 runs   (    0.28 ms per token,  3611.71 tokens per second)
llama_print_timings: prompt eval time =     312.07 ms /     8 tokens (   39.01 ms per token,    25.63 tokens per second)
llama_print_timings:        eval time =    6037.91 ms /   113 runs   (   53.43 ms per token,    18.72 tokens per second)
llama_print_timings:       total time =    6805.22 ms /   121 tokens


?
A: Generative AI refers to a type of artificial intelligence that can create new content, ideas or solutions based on existing data. It uses machine learning algorithms to analyze and understand patterns in the data, and then generates new outputs that are similar but not identical to the original inputs. This can include things like generating text, images, music, videos, and even entire websites or applications. Generative AI is often used in creative industries such as advertising, marketing, entertainment, and design, where it can help automate repetitive tasks and generate new ideas.


In [4]:
answer_3 = llm.invoke("Explain the huggingface-cli like I'm 5 years old.")
print(answer_3)

Llama.generate: prefix-match hit

llama_print_timings:        load time =    6205.29 ms
llama_print_timings:      sample time =      28.49 ms /    73 runs   (    0.39 ms per token,  2562.57 tokens per second)
llama_print_timings: prompt eval time =     538.94 ms /    16 tokens (   33.68 ms per token,    29.69 tokens per second)
llama_print_timings:        eval time =    3904.04 ms /    73 runs   (   53.48 ms per token,    18.70 tokens per second)
llama_print_timings:       total time =    4857.62 ms /    89 tokens



Hugging Face is a company that makes software for computers to understand human language. They have a special tool called "huggingface-cli" that helps people use their software on their own computers. It's like having your own personal assistant that can help you with things like translating words from one language to another or finding information online.


The answers provided by the 7B model may not seem as impressive as those from the latest OpenAI or Google models, but consider the significant size difference - they perform very well. These models may not have the most extensive knowledge base, but for our purposes, we only need them to generate coherent English. We'll then infuse them with specialised knowledge on a topic of your choice, resulting in a local, specialised model that can function offline.

---
## 4.&nbsp; Challenge 😀
Play around with this, and other, LLMs. keep a record of your findings:
1. Pose different questions to the model, each subtly different from the last. Observe the resulting outputs. Smaller models tend to be highly sensitive to minor changes in language and grammar.
2. Experiment with the parameters, one at a time, to assess their impact on the output.
3. Attempt to load different models: Explore the [models page on HuggingFace](https://huggingface.co/models). You can use the left hand menu to find `Text Generation` under `Natural Language Processing`. Then use the filter bar for `GGUF` to find already quantised models.

You can alter the download command accordingly. In this note book we used the command:

In [None]:
# !huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GGUF mistral-7b-instruct-v0.1.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

This downloads the version `mistral-7b-instruct-v0.1.Q4_K_M.gguf` of the model `TheBloke/Mistral-7B-Instruct-v0.1-GGUF` from huggingface. You can read about the different versions on the models `model card`.

To adapt this just change the model and the version to your new choice.

`!huggingface-cli download {model_name} {model_version} --local-dir . --local-dir-use-symlinks False`

For example:

In [None]:
# !huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

In [5]:
#llama-2-7b-chat.Q4_K_M.gguf
from langchain.llms import LlamaCpp

llm1 = LlamaCpp(model_path = "/Users/sadiakhanrupa/Bootcamp Main Phase/Chapter_8_generative_Ai/llama-2-7b-chat.Q4_K_M.gguf",
               max_tokens = 2000,
               temperature = 0.1,
               top_p = 1,
               n_gpu_layers = -1)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /Users/sadiakhanrupa/Bootcamp Main Phase/Chapter_8_generative_Ai/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   

In [6]:
answer_4 = llm1.invoke('how can I get a data analyst job?')
print(answer_4)


llama_print_timings:        load time =    5573.47 ms
llama_print_timings:      sample time =     112.51 ms /   501 runs   (    0.22 ms per token,  4453.02 tokens per second)
llama_print_timings: prompt eval time =    5783.98 ms /    11 tokens (  525.82 ms per token,     1.90 tokens per second)
llama_print_timings:        eval time =   26561.44 ms /   500 runs   (   53.12 ms per token,    18.82 tokens per second)
llama_print_timings:       total time =   33938.72 ms /   511 tokens



I am interested in pursuing a career as a data analyst. How do I go about getting a job in this field? What are some of the key skills and qualifications that employers typically look for in a data analyst candidate?
Answer: Pursuing a career as a data analyst can be an exciting and rewarding choice, as organizations across various industries increasingly rely on data-driven decision making. Here are some steps you can take to increase your chances of landing a job as a data analyst:
1. Build a strong foundation in statistics and mathematics: Data analysis involves working with large datasets, identifying patterns, and drawing meaningful insights. Therefore, it is essential to have a good grasp of statistical concepts such as regression analysis, hypothesis testing, and time series analysis. Additionally, proficiency in mathematical concepts like linear algebra and calculus can be helpful.
2. Learn data visualization tools: Data analysts use various visualization tools to present thei

The code above would download a [quantised version of Meta's Llama 2 7B chat](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main).








