<a href="https://colab.research.google.com/github/ThomasEKolb/dighum-gpt-colab/blob/main/dighum_gpt_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Digital Humanism Summer School 2023: Hands-on Session ChatGPT


# Session 1

## 1. Install All Required Packages

In [None]:
%pip install langchain
%pip install pypdf
%pip install openai
%pip install tiktoken
%pip install chromadb
%pip install pdfminer
%pip install wget
%pip install pdfminer.six
%pip install unstructured

## 2. Connect Your Personal Google Drive (optional)

You can attach your personal google drive with the following code:

In [None]:
# from google.colab import drive
# drive.mount('/content/drive/')

## 3. Fill In Your OpenAI API Key

In [None]:
api_key = "" # @param {type:"string"}

## 4. Now We Initialize the Openai Connection

In [None]:
import os
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI

os.environ['OPENAI_API_KEY'] = api_key
llm = OpenAI()
chat_model = ChatOpenAI()
chat_model.predict("hi!")

'Hello! How can I assist you today?'

## 5. Parse the Book PDF into Text

In [None]:
# from langchain.document_loaders import PyPDFLoader
# from langchain.document_loaders import UnstructuredPDFLoader
from langchain.document_loaders import PDFMinerLoader
import wget

url_book = "https://owncloud.tuwien.ac.at/index.php/s/FW7Y2GNUOaUtrhf/download" # book as pdf
book_filename = wget.download(url_book)
print(book_filename)

# loader = PyPDFLoader("978-3-030-86144-5.pdf")
# loader = UnstructuredPDFLoader("978-3-030-86144-5.pdf")
loader = PDFMinerLoader("./978-3-030-86144-5.pdf")
book = loader.load()

978-3-030-86144-5.pdf


In [None]:
print (f'You have {len(book)} books(s) in your data')
print (f'There are {len(book[0].page_content)} characters in your book')

You have 1 books(s) in your data
There are 852331 characters in your book


## 6. Split the Text into Smaller Chunks

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
# https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(book)

In [None]:
print (f'The book is now divided into {len(texts)} parts.')

The book is now divided into 1182 parts.


## 7. Create Embeddings Based on the Created Chunks

In [None]:
from langchain.embeddings import OpenAIEmbeddings
embedding = OpenAIEmbeddings(chunk_size=1,deployment='text-embedding-ada-002')

In [None]:
#from langchain.vectorstores import Chroma
#persist_directory = './db' # or store it directly into your personal drive: '/content/drive/MyDrive/db'
#croma_db = Chroma.from_documents(documents=texts, embedding=embedding, persist_directory=persist_directory)
#croma_db.persist() # store embeddings

In [None]:
from langchain.vectorstores import Chroma
persist_directory = './db'

# download the embeddings (to save some time and cost)
url_embeddings = "https://owncloud.tuwien.ac.at/index.php/s/xwFts1mQo3VXiCg/download" # embeddings of the book
embeddings_filename = wget.download(url_embeddings)

import zipfile
with zipfile.ZipFile('./'+embeddings_filename, 'r') as zip_ref:
    zip_ref.extractall('./')

print(embeddings_filename)

# load data from vector store
croma_db = Chroma(persist_directory=persist_directory, embedding_function=embedding)

db-20230831T161118Z-001.zip


## 8. Make Use of the Langchain Framework to Provide a Text Based Search Interface

In [None]:
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=llm,
                                 chain_type="stuff",
                                 retriever=croma_db.as_retriever(),
                                 return_source_documents=False
                                 )

## Fill in your questions about the book

In [None]:
query = "Explain digital humanism to a child of 6 years." # @param {type:"string"}
qa.run(query)

' Digital humanism is a way of using technology to make the world better for people and animals and to protect the environment for future generations. It encourages people to make their own decisions and be in control of the technology, rather than letting machines make decisions for them.'

# Session 2

## 1. Incorporate various heterogeneous data sources

In [None]:
from langchain.document_loaders import UnstructuredHTMLLoader
from langchain.document_loaders import TextLoader

import wget

url_program = "https://owncloud.tuwien.ac.at/index.php/s/hX8wLPxv6rXAdgM/download" #https://caiml.dbai.tuwien.ac.at/dighum/summerschool2023/program/
url_manifesto = "https://owncloud.tuwien.ac.at/index.php/s/h2457gToxxlgnpN/download" #https://caiml.dbai.tuwien.ac.at/dighum/dighum-manifesto/
program_filename = wget.download(url_program)
manifesto_filename = wget.download(url_manifesto)
print(program_filename,manifesto_filename)

html_loader = UnstructuredHTMLLoader(program_filename)
txt_loader = TextLoader(manifesto_filename)
program = html_loader.load()
manifesto = txt_loader.load()

dighum_program (1).html DigHum_Manifesto (1).txt


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
print (f'There are {len(program[0].page_content)} characters on the webpage')
print (f'There are {len(manifesto[0].page_content)} characters in the manifesto')

There are 10955 characters on the webpage
There are 7871 characters in the manifesto


In [None]:
text_splitter2 = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

program_texts = text_splitter2.split_documents(program)
manifesto_texts = text_splitter2.split_documents(manifesto)
combined_texts = program_texts + manifesto_texts

print (f'The texts are now divided into {len(combined_texts)} parts.')

The texts are now divided into 28 parts.


In [None]:
from langchain.vectorstores import Chroma
persist_directory2 = './db2' # or store it directly into your personal drive: '/content/drive/MyDrive/db'
croma_db2 = Chroma.from_documents(documents=combined_texts, embedding=embedding, persist_directory=persist_directory2)
croma_db2.persist() # store embeddings

In [None]:
qa2 = RetrievalQA.from_chain_type(llm=llm,
                                 chain_type="stuff",
                                 retriever=croma_db2.as_retriever(),
                                 return_source_documents=False
                                 )

In [None]:
query = "Show me the schedule of the summer school for Monday morning?" # @param {type:"string"}
qa2.run(query)

' Monday, September 4, 2023\n8:30 - 9:00 Registration\nMorning (9:00-12:30)\n9:00 - 9:30 Opening and Welcome\n9:30 - 9:45 Welcome Address by Dean Gerti Kappel\n9:45 - 10:00 Welcome Address by\xa0Enrico Nardelli\xa0(ACM Europe)\n10:00 - 11:00\xa0Hannes Werthner: Introduction to Digital Humanism\n11:30 - 12:30\xa0George Metakides: Digital Enlightenment'

In [None]:
query = "Who are the authors of the DigHum Manifesto?" # @param {type:"string"}
qa2.run(query)

' The authors of the DigHum Manifesto are scientists and practitioners from across fields and topics, brought together by concerns and hopes for the future.'

## 2. Prompt engineering

In [None]:
from langchain import PromptTemplate
summary_template = PromptTemplate.from_template("Please give me a concise of the book chapter {chapter} by {author}.")

In [None]:
qa.run(summary_template.format(chapter="Are We Losing Control?",author="Edward A. Lee"))

' The book chapter Are We Losing Control? by Edward A. Lee suggests that humanity never had full control over technology, but that it is possible to nudge the process in a more humane direction through a more human-centric approach. Intellectuals from all disciplines, technologists with a deeper understanding of the humanities, and policy makers must work together to achieve this goal.'

In [None]:
recommendation_template = PromptTemplate.from_template(
"I want you to act as a recommender system that recommends book chapters (along with their authors) that are related to the following topic: {topic}")

In [None]:
qa.run(recommendation_template.format(topic="Artificial Intelligence"))

' Chapter 3, "The Attention Economy and the Impact of Artificial Intelligence," by Lynda Hardman.'