<a href="https://colab.research.google.com/github/michalis0/DataScience_and_MachineLearning/blob/master/10-gen-ai/Week_10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install langchain
!pip install wikipedia
!pip install chromadb
!pip install tiktoken
!pip install pypdf
!pip install faiss-cpu


In [None]:
!pip install openai==0.28.1

In [3]:
# import libraries
import pandas as pd
import numpy as np
import wikipedia
import urllib.request
import bs4 as bs
import os

#import sklearn
from sklearn.model_selection import train_test_split


# import Langchain
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain.llms import OpenAI
from langchain.schema import HumanMessage
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import WebBaseLoader
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import Chroma
from langchain.callbacks import get_openai_callback
from langchain.chains import create_tagging_chain, create_tagging_chain_pydantic
from langchain.vectorstores import FAISS

# Generative AI

<img src='https://images.unsplash.com/photo-1686191568035-db49125b65c6?auto=format&fit=crop&q=80&w=2940&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D' width="450">

Credit: [Mojahid Mottakin](https://unsplash.com/@iammottakin)

## Content

The goal of this walkthrough is to provide you with insights on [Generative AI](https://en.wikipedia.org/wiki/Generative_artificial_intelligence). Generative AI refers to artificial intelligence systems that can create new and original content, such as text, images, or music, without direct human input.

We will first see some applications that we can have with GenAI and the text. We will also see that using GenAI can be costly and we will finally try to fine tune a model.

- [First steps with LangChain and OpenAI](#First-steps)
    - [Claculate the cost](#Cost)
- [Text Summarization](#Text-Summarization)
- [Sentiment analysis](#Sentiment-analysis)
- [Text embedding](#Text-embedding)
- [Fine-tuning](#Fine-tuning)
- [Your turn!](#exercices)
    - [Prompt](#prompting)
    - [Summarizing](#Summarizing)
    - [Sentiment analysis](#Sentiment-analysis-ex)

## First steps with LangChain and OpenAI
During this exercise session, we will be using the [OpenAI](https://platform.openai.com/docs/api-reference) API with different libraries.
First we will need to setup the apikey and some of the librairies.

Due to recent changes in the openAI library, there are incompatibilities with the Langchain library. We will therfore use a previous version by using this command:

```
!pip install openai==0.28.1
```
We will also use the [Langchain](https://python.langchain.com/docs/get_started/introduction) library for the prompting and the text applications.

First, we will define the API key. You can find how to create an API key with your account in this document : [Guide-API-Key-OpenAi](https://github.com/michalis0/DataScience_and_MachineLearning/blob/master/10-gen-ai/Guide-API-Key-OpenAI.ipynb)

In [None]:
# define the openAI API key
os.environ['OPENAI_API_KEY'] = 'YOUR OPENAI API KEY'
print(os.getenv('OPENAI_API_KEY'))

To try the key, let's use ChatOpenAI like we would use ChatGPT. To do so, we will need to import the ChatOpenAI model:


```
from langchain.chat_models import ChatOpenAI
```



In [None]:
# define the chat and try it with a simple message
chat_model  = ChatOpenAI()
llm = OpenAI()

# print(llm.predict("Hi! How are you?")) -> find another one
print(chat_model.predict("Hi! How are you?"))

You can also use it differently as it is shown in the two following examples.

In [None]:
text = "What would be a good startup name for startup from a student who just graduated from a degree in Information Systems and Digital Innovation?"

#print(llm.predict(text))
print(chat_model.predict(text))

In [None]:
text = "What would be a good startup name for startup from a student who just graduated from a degree in Information Systems and Digital Innovation?"
messages = [HumanMessage(content=text)]

#print(llm.predict_messages(messages).content)
print(chat_model.predict_messages(messages).content)

With Langchain, you can directly pass specifications for a prompt by using the ```PromptTemplate``` module.

In [None]:
prompt = PromptTemplate.from_template("What is a good name for a startup that works in {field}?")
prompt.format(field="Information Systems and Digital Innovation")

### Calculate the cost
Everything has a cost and the chatGPT API is not free. It depends on how many tokens we are giving to the API. You can have the pricing [here](https://openai.com/api/pricing/).

To give you an idea of the costs, calculate the price of using the GPT-4 model with a document that contains 86'000 tokens and you want a summary of 1'000 tokens.

In [None]:
# Your turn!

## Text Summarization
GenerativeAI can be very useful to summarize data. We will continue to use [Langchain](https://python.langchain.com/docs/use_cases/summarization) for our implementation. We will try to summarize the Wikipedia page of [*Information System*](https://en.wikipedia.org/wiki/Information_system). We will first load the text, the model [gpt-3.5-turbo](https://platform.openai.com/docs/models/gpt-3-5#gpt-3-5-turbo), which allow us to have 16k tokens, and the chain.

In [20]:
# load the text
loader = WebBaseLoader("https://en.wikipedia.org/wiki/Information_system")
docs = loader.load()

# load the model and the chain
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo")
chain = load_summarize_chain(llm, chain_type="stuff")

### Calculate the cost
Using openAI API is not free. You can calculate the cost of your model like this:


In [None]:
#get the costs
with get_openai_callback() as cb:
  result = chain.run(docs)
  print(cb)
  display(result)

## Sentiment analysis

By using Langchain, we can also classify the text. You can give different categories to the model and it will try to classify the sentence. For example, we will do a sentiment analysis and we will try to recognize in which language it is written.


In [23]:
# Schema
schema = {
    "properties": {
        "sentiment": {
            "type": "string",
            "enum":["positive", "neutral", "negative"],
        },
        "language": {
            "type": "string",
            "enum": ["spanish", "english", "french", "german", "italian"],
        },
    },
    "required": ["sentiment", "language"],
}

# LLM
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo") 
chain = create_tagging_chain(schema, llm)

In [None]:
# Test
input = "I love dogs!"

chain.run(input)

## Text embedding
LangChain is also useful for text embedding. You can see the different models [here](https://python.langchain.com/docs/integrations/text_embedding/). You can have a look at the natural language processing class if you don't remeber why text embedding is important.

You can either do the text embedding for a list of texts or a single piece of text.

In [25]:
embeddings = OpenAIEmbeddings()

In [None]:
#embeding a single query
text = "This is a test document."
query_result = embeddings.embed_query(text)
query_result[:5]
len(query_result)

In [None]:
#embeding a list of texts
embedds = embeddings.embed_documents(
    [
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]
)
len(embedds), len(embedds[0])

## Fine-tuning

Fine-tuning refers to the process of taking a pre-trained model (like chatGPT) and further training it on a specific task or dataset to improve its performance on that particular task. Instead of training a model from scratch, which can be computationally expensive and time-consuming, fine-tuning leverages the knowledge and features learned by a model on a large and diverse dataset.

You can have a more detailed article [here](https://medium.com/@dataoilst.info/fine-tuning-langchain-llm-applications-a-technical-perspective-part-1-4b4c552ab557).

For now, we will use the previous text and try to ask the model a question.

In [None]:
#processing the document
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(texts, embeddings)
retriever = db.as_retriever()

In [None]:
#asking the question
retrieved_docs = retriever.invoke(
    "What is an information system?"
)
print(retrieved_docs[0].page_content)

We can see that the model is simply giving us the parts where information system is mentionned. Therfore, the model is not very useful. You can however go a bit further if you wish to integrate it better.

## Your turn!

### Prompt
Create a prompt to ask chatGPT some project ideas for a Data Science and Machine Learning class. Try to apply some rules of [prompt engineering](https://www.datacamp.com/tutorial/a-beginners-guide-to-chatgpt-prompt-engineering).

In [None]:
# Your code


### Summarizing

You will now try to summarize one of your lesson and print how much it costed by using the model ```gpt-3.5-turbo```. You might want to look at the package ```PyPDFLoader``` in order to load your pdf.



In [36]:
# Your code


### Sentiment analysis
Try to classify this sentence: ```J'aime l'intelligence artificielle.```

In [None]:
# Your code
