## Objectives

After completing this lab, you will be able to:

- Gain an understanding of generative AI and its impact across various domains.
- Familiarize yourself with different types of models in generative AI.
- Acquire the skills to build and interact with a chatbot using transformers.


## What is generative AI?
Imagine presenting a computer with a vast array of paintings. After analyzing them, it tries to craft a unique painting of its own. This capability is termed generative AI. Essentially, the computer derives inspiration from the provided content and uses it to create something new.

## Real-world impact of generative AI
Generative AI is transforming multiple industries. Its applications span:

### 1. Art and creativity
- Generative art: Artists employing generative AI algorithms can create stunning artworks by learning from existing masterpieces and producing unique pieces inspired by them. These AI-generated artworks have gained recognition in the art world.
- Music Composition: Projects in the realm of generative AI have been employed to compose music. They learn from a vast data set of musical compositions and can generate original pieces in various styles, from classical to jazz, revolutionizing the music industry.

### 2. Natural language processing (NLP)
- Content generation: Tools like generative pre-trained transformer (GPT) have demonstrated their ability to generate coherent and context-aware text. They can assist content creators by generating articles, stories, or marketing copy, making them valuable tools in content creation.
- Chatbots and virtual assistants: Generative AI powers many of today's chatbots and virtual assistants. These AI-driven conversational agents understand and generate human-like responses, enhancing user experiences.
- Code Writing: Generative AI models can also produce code snippets based on descriptions or requirements, streamlining software development.

### 3. Computer vision
- Image synthesis: Models like data analysis learning with language model for generation and exploration, frequencly known as DALL-E, can generate images from textual descriptions. This technology finds applications in graphic design, advertising, and creating visual content for marketing.
- Deepfake detection: With the advancement in generative AI techniques, the generation of deep fake content is also on the rise. Consequently, generative AI now plays a role in developing tools and techniques to detect and combat the spread of misinformation through manipulated videos.

### 4. Virtual avatars
- Entertainment: Generative AI is utilized to craft virtual avatars for gaming and entertainment. These avatars mimic human expressions and emotions, bolstering user engagement in virtual environments.
- Marketing: Virtual influencers, propelled by generative AI, are on the rise in digital marketing. Brands are harnessing these virtual personas to endorse their products and services.

## Neural structures behind generative AI
Before we had the powerful transformers, which are like super-fast readers and understand lots of words at once, there were other methods used for making computers generate text. These methods were like the building blocks that led to the amazing capabilities we have today.

## Large language models (LLMs)
Large language models are like supercharged brains. They are massive computer programs with lots of "neurons" that learn from huge amounts of text. These models are trained to do tasks like understanding and generating text, and they're used in many applications. However, there's a limitation: these models are not very good at understanding the bigger context or the meaning of words. They work well for simple predictions but struggle with more complex text.

## Text generation before transformers

### 1. N-gram language models
N-gram models are like language detectives. They predict what words come next in a sentence based on the words that came before. For example, if you say "The sky is," these models guess that the next word might be "blue."

### 2. Recurrent neural networks (RNN)
Recurrent neural networks (RNNs) are specially designed to handle sequential data, making them a powerful tool for applications like language modeling and time series forecasting. The essence of their design lies in maintaining a 'memory' or 'hidden state' throughout the sequence by employing loops. This enables RNNs to recognize and capture the temporal dependencies inherent in sequential data.
- Hidden state: Often referred to as the network's 'memory', the hidden state is a dynamic storage of information about previous sequence inputs. With each new input, this hidden state is updated, factoring in both the new input and its previous value.
- Temporal dependency: Loops in RNNs enable information transfer across sequence steps.

<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0J87EN/%E9%80%92%E5%BD%92%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C%E5%9B%BE.png" width="60%" height="60%"> 

<div style="text-align:center"><a href="https://commons.wikimedia.org/wiki/File:%E9%80%92%E5%BD%92%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C%E5%9B%BE.png">Image Source</a></div>

Illustration of RNN's operation: Consider a simple sequence, such as the sentence: "I love RNNs". The RNN interprets this sentence word by word. Beginning with the word "I", the RNN ingests it, generates an output, and updates its hidden state. Moving on to "love", the RNN processes it alongside the updated hidden state which already holds insights about the word "I". The hidden state is updated again post this. This pattern of processing and updating continues until the last word. By the end of the sequence, the hidden state ideally encapsulates insights from the entire sentence.
                                                                                                       
### 3. Long short-term memory (LSTM) and gated recurrent units (GRUs)
Long short-term memory (LSTM) and gated recurrent units (GRUs) are advanced variations of recurrent neural networks (RNNs), designed to address the limitations of traditional RNNs and enhance their ability to model sequential data effectively. They processed sequences one element at a time and maintained an internal state to remember past elements. While they were effective for a variety of tasks, they struggled with long sequences and long-term dependencies.

### 4. Seq2seq models with attention
- Sequence-to-sequence (seq2seq) models, often built with RNNs or LSTMs, were designed to handle tasks like translation where an input sequence is transformed into an output sequence.
- The attention mechanism was introduced to allow the model to "focus" on relevant parts of the input sequence when generating the output, significantly improving performance on tasks like machine translation.

While these methods provided significant advancements in text generation tasks, the introduction of transformers led to a paradigm shift. Transformers, with their self-attention mechanism, proved to be highly efficient at capturing contextual information across long sequences, setting new benchmarks in various NLP tasks.

## Transformers
Proposed in a paper titled "Attention Is All You Need" by Vaswani et al. in 2017, the transformer architecture replaced sequential processing with parallel processing. The key component behind its success? The attention mechanism, more precisely, self-attention.

Key steps include:
- Tokenization: The first step is breaking down a sentence into tokens (words or subwords).
- Embedding: Each token is represented as a vector, capturing its meaning.
- Self-attention: The model computes scores determining the importance of every other word for a particular word in the sequence. These scores are used to weight the input tokens and produce a new representation of the sequence. For instance, in the sentence "He gave her a gift because she'd helped him", understanding who "her" refers to requires the model to pay attention to other words in the sentence. The transformer does this for every word, considering the entire context, which is particularly powerful for understanding meaning.
- Feed-forward neural networks: After attention, each position is passed through a feed-forward network separately.
- Output sequence: The model produces an output sequence, which can be used for various tasks, like classification, translation, or text generation.
- Layering: Importantly, transformers are deep models with multiple layers of attention and feed-forward networks, allowing them to learn complex patterns.

The architecture's flexibility has allowed transformers to be used beyond NLP, finding applications in image and video processing too. In NLP, transformer-based models like BERT, GPT, and their variants have set state-of-the-art results in various tasks, from text classification to translation.

### Implementation: Building a simple chatbot with transformers
Now, you will build a simple chatbot using `transformers` library from Hugging Face, which is an open-source natural language processing (NLP) toolkit with many useful features.
#### Step 1: Installing libraries


In [1]:
!pip install -qq tensorflow
!pip install transformers==4.42.1 -U
!pip install sentencepiece
!pip install torch == 2.2.2
!pip install torchtext==0.17.2
!pip install numpy==1.26


zsh:1: = not found
Collecting numpy==1.26
  Using cached numpy-1.26.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (53 kB)
Using cached numpy-1.26.0-cp312-cp312-macosx_11_0_arm64.whl (13.7 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.0 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-1.26.0


## ChatBot Making steps

1. Select tokenizer and Model
2. Create an instance of that model
3. Create an instance of that tokenizer
4. Create function:
    1. create user interface (input)
    2. inputs: tokenizer.encode(input)
    3. outputs: model.gererate(inputs)
    4. response: tokenizer.decode(outputs)

In [2]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = 'facebook/blenderbot-400M-distill'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name) # model

tokenizer = AutoTokenizer.from_pretrained(model_name) # Tokenizer
    

In [3]:
def chat_with_bot():
    while True:
        # Get user input
        input_text = input("You: ")

        # Exit conditions
        if input_text.lower() in ["quit","exit","bye"]:
            print("chatbot: Goodbye!")
            break

        # TOKENIZE input and generate response
        inputs = tokenizer.encode(input_text,return_tensors = 'pt')
        outputs = model.generate(inputs,max_new_tokens = 150)
        response = tokenizer.decode(outputs[0],skip_special_tokens = True).strip()

        #display bot's response
        print("Chatbot:",response)


In [4]:
chat_with_bot()

You: bye
chatbot: Goodbye!


## Tokenizer

For this lab, you will be using the following libraries:

* [`nltk`](https://www.nltk.org/) or natural language toolkit, will be employed for data management tasks. It offers comprehensive tools and resources for processing natural language text, making it a valuable choice for tasks such as text preprocessing and analysis.


* [`spaCy`](https://spacy.io/) is an open-source software library for advanced natural language processing in Python. spaCy is renowned for its speed and accuracy in processing large volumes of text data.


* [`BertTokenizer`](https://huggingface.co/docs/transformers/main_classes/tokenizer#berttokenizer) is part of the Hugging Face Transformers library, a popular library for working with state-of-the-art pre-trained language models. BertTokenizer is specifically designed for tokenizing text according to the BERT model's specifications.


* [`XLNetTokenizer`](https://huggingface.co/docs/transformers/main_classes/tokenizer#xlnettokenizer) is another component of the Hugging Face Transformers library. It is tailored for tokenizing text in alignment with the XLNet model's requirements.


* [`torchtext`](https://pytorch.org/text/stable/index.html) It is part of the PyTorch ecosystem, to handle various natural language processing tasks. It  simplifies the process of working with text data and provides functionalities for data preprocessing, tokenization, vocabulary management, and batching.


In [5]:
!pip install nltk
!pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm
!pip install scikit-learn


Collecting numpy>=1.19.0 (from spacy)
  Using cached numpy-2.2.5-cp312-cp312-macosx_14_0_arm64.whl.metadata (62 kB)
Using cached numpy-2.2.5-cp312-cp312-macosx_14_0_arm64.whl (5.2 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.0
    Uninstalling numpy-1.26.0:
      Successfully uninstalled numpy-1.26.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gensim 4.3.3 requires numpy<2.0,>=1.18.5, but you have numpy 2.2.5 which is incompatible.
contourpy 1.2.0 requires numpy<2.0,>=1.20, but you have numpy 2.2.5 which is incompatible.
numba 0.60.0 requires numpy<2.1,>=1.22, but you have numpy 2.2.5 which is incompatible.
tensorflow 2.19.0 requires numpy<2.2.0,>=1.26.0, but you have numpy 2.2.5 which is incompatible.
transformers 4.42.1 requires numpy<2.0,>=1.17, but you have numpy 2.2.5 which is i

Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 148, in _get_module_details
  File "<frozen runpy>", line 112, in _get_module_details
  File "/opt/anaconda3/lib/python3.12/site-packages/spacy/__init__.py", line 6, in <module>
  File "/opt/anaconda3/lib/python3.12/site-packages/spacy/errors.py", line 3, in <module>
    from .compat import Literal
  File "/opt/anaconda3/lib/python3.12/site-packages/spacy/compat.py", line 4, in <module>
    from thinc.util import copy_array
  File "/opt/anaconda3/lib/python3.12/site-packages/thinc/__init__.py", line 5, in <module>
    from .config import registry
  File "/opt/anaconda3/lib/python3.12/site-packages/thinc/config.py", line 5, in <module>
    from .types import Decorator
  File "/opt/anaconda3/lib/python3.12/site-packages/thinc/types.py", line 27, in <module>
    from .compat import cupy, has_cupy
  File "/opt/anaconda3/lib/python3.12/site-packages/thinc/compat.py", li

In [6]:
import nltk
nltk.download("punkt")
nltk.download("punkt_tab")
from nltk.tokenize import word_tokenize

from transformers import BertTokenizer,XLNetTokenizer


from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/tinonturjamajumder/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/tinonturjamajumder/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [7]:
text = 'I saw unicorns yesterday. I also see an unicorn today. They can fly'

from nltk import word_tokenize
tokenizer = word_tokenize(text)
print(tokenizer)

['I', 'saw', 'unicorns', 'yesterday', '.', 'I', 'also', 'see', 'an', 'unicorn', 'today', '.', 'They', 'can', 'fly']


In [8]:
# This showcases the use of the 'spaCy' 'tokenizer'
import spacy
!python -m spacy download en_core_web_sm

text = "I couldn't help the dog. Can't you do it? Don't be afraid if you are."

nlp = spacy.load("en_core_web_sm")

doc = nlp(text)

# Making a list of the tokens and printing the list
token_list = [token.text for token in doc]

for token in doc:
    print(token.text, token.pos_,token.dep_)


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.5 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 148, in _get_module_details
  File "<frozen runpy>", line 112, in _get_module_details
  File "/opt/anaconda3/lib/python3.12/site-packages/spacy/__init__.py", line 6, in <module>
  File "/opt/anaconda3/lib/python3.12/site-packages/spacy/errors.py", line 3, in <module>
    from .compat import Literal
  File "/opt/anaconda3/lib/python3.12/site-packages/spacy/compat.py", line 4, in <module>
    from thinc.util import copy_array
  File "/opt

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

In [9]:
!pip install jupyterthemes

Collecting numpy>=1.23 (from matplotlib>=1.4.3->jupyterthemes)
  Using cached numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl.metadata (61 kB)


Using cached numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl (13.7 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.5
    Uninstalling numpy-2.2.5:
      Successfully uninstalled numpy-2.2.5
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-1.26.4


In [11]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.tokenize("IBM taught me tokenization")

['ibm', 'taught', 'me', 'token', '##ization']

In [12]:
xltokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
xltokenizer.tokenize("IBM taught me tokenization")

['▁IBM', '▁taught', '▁me', '▁token', 'ization']

In [67]:
## Tokenization with PyTorch
dataset= ([(1,"Introduction to NLP"),
    (2,"Basics of PyTorch"),
    (1,"NLP Techniques for Text Classification"),
    (3,"Named Entity Recognition with PyTorch"),
    (3,"Sentiment Analysis using PyTorch"),
    (3,"Machine Translation with PyTorch"),
    (1," NLP Named Entity,Sentiment Analysis,Machine Translation "),
    (1," Machine Translation with NLP "),
    (1," Named Entity vs Sentiment Analysis  NLP ")])

# get tokenizer

from torchtext.data.utils import get_tokenizer

first_sentence = dataset[0][1]

torch_tokenizer = get_tokenizer("basic_english")
torch_tokenizer(first_sentence)


['introduction', 'to', 'nlp']

## Token Indices

In [68]:
def yield_tokens(data_iter):
    for _,text in data_iter:
        yield torch_tokenizer(text)

In [73]:
iterat = yield_tokens(dataset)

In [61]:
for token in my_iterator:
    print(token)

## Out-of-Vocabulary

In [89]:
from torchtext.vocab import build_vocab_from_iterator

# vocab takes an iterator as input

vocab = build_vocab_from_iterator(iterator = my_iterator,specials = ["<unk>"])

vocab.set_default_index(vocab["<unk>"])

In [96]:
def get_tokenized_sentence_and_indice(iterator):
    
    tokenized_sentences,tokenized_indices = [],[]
    tokenized_sentence = next(iterator)
    token_indice = [vocab[token] for token in tokenized_sentence]

    tokenized_sentences.append(tokenized_sentence)
    tokenized_indices.append(token_indice)
        
    return tokenized_sentences,tokenized_indices
    
from torchtext.vocab import build_vocab_from_iterator
vocab = build_vocab_from_iterator(my_iterator,specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

dataset_iteration = yield_tokens(dataset)
tokenized_sentences,tokenized_indices = get_tokenized_sentence_and_indice(dataset_iteration)



TypeError: lookup_indices(): incompatible function arguments. The following argument types are supported:
    1. (self: torchtext._torchtext.Vocab, arg0: list) -> List[int]

Invoked with: <torchtext._torchtext.Vocab object at 0x150c3da70>, 'introduction'

In [None]:
tokenized_sentences

In [93]:
tokenized_indices

[[0, 0, 0]]