## Introduction

In this tutorial, we‚Äôll walk through the process of interacting with a Google Cloud Storage (GCS) bucket named dauphine-bucket, specifically focusing on the data directory within the bucket. We‚Äôll cover how to:

- List all files in the bucket‚Äôs data directory.
- Retrieve information about a specific file.
- Read files using the Unstructured library.
- Visualize the extracted documents with LangChain.

This guide is intended for users who are familiar with Python and basic cloud storage concepts.

Prerequisites

Before we begin, ensure you have the following:

- Python 3.x installed on your system.
- Access to the GCP bucket dauphine-bucket/data with the necessary permissions.
- Google Cloud SDK installed and authenticated. You can authenticate by running:

In [1]:
import os
import aiohttp
from dotenv import load_dotenv
from sqlalchemy.exc import ProgrammingError

from google.cloud import storage
from google.cloud.storage.bucket import Bucket

from langchain_core.documents.base import Document
from langchain_google_cloud_sql_pg import PostgresEngine, PostgresVectorStore
from langchain_google_vertexai import VertexAIEmbeddings
from langchain_unstructured import UnstructuredLoader

from langchain.schema import Document
from langchain_core.documents.base import Document

In [1]:
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
!gcloud auth login

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=32555940559.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=QydxrDHW5SusgPNzsplgQe8IMJ03mR&access_type=offline&code_challenge=Lpcsly_6W0stkBu1zONsUFwNgyx7Su1ppodbxvko5Pk&code_challenge_method=S256


You are now logged in as [krimismr@gmail.com].
Your current project is [dauphine-437611].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID


The following Python libraries installed:
- google-cloud-storage
- unstructured
- langchain

To know more about the libraries, you can visit the following links:
- [google-cloud-storage](https://googleapis.dev/python/storage/latest/index.html)
- [unstructured](https://docs.unstructured.io/examplecode/codesamples/oss/vector-database)
- [langchain](https://langchain.readthedocs.io/en/latest/)


In [3]:
%pip install -q google-cloud-storage unstructured langchain python-magic sqlalchemy langchain_google_cloud_sql_pg
%pip install -q "unstructured[pptx]"

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


# I. Listing and loading files from a GCS bucket

### I.1. Listing Files in the GCP Bucket

Explanation:

To interact with a GCS bucket, we‚Äôll use the google-cloud-storage library. We‚Äôll initialize a client, access the bucket, and list all the files within the data directory.

Code:

In [1]:
# Import the necessary library
from google.cloud import storage

# Initialize a client
client = storage.Client()

# Access the bucket
bucket_name = 'dauphine-bucket'
bucket = client.get_bucket(bucket_name)

# List all files in the 'data' directory
blobs = bucket.list_blobs(prefix='data/')

print("Files in 'dauphine-bucket/data':")
for blob in blobs:
    print(blob.name)

Files in 'dauphine-bucket/data':
data/
data/1 - Gen AI - Dauphine Tunis.pptx
data/2.1 - Before Transformers - Gen AI - Dauphine Tunis.pptx
data/2.2  - Transformers - Gen AI - Dauphine Tunis.pptx
data/3 - Retrieval Augmented Generation - Gen AI - Dauphine Tunis.pptx


Output Explanation:

Running this code will display all the file paths within the data directory of the bucket. The prefix='data/' parameter ensures we only get files from that specific directory.

### I.2. Getting Information About One File


Explanation:

Sometimes, you may need detailed information about a specific file, such as its size, content type, or the last time it was updated. We‚Äôll retrieve this metadata for a chosen file.


In [3]:
# Specify the file path (replace with an actual file from your bucket)
file_path = 'data/1 - Gen AI - Dauphine Tunis.pptx'

# Get the blob object
blob = bucket.get_blob(file_path)

if blob:
    print(f"Information for '{file_path}':")
    print(f"Size: {blob.size} bytes")
    print(f"Content Type: {blob.content_type}")
    print(f"Updated On: {blob.updated}")
    print(f"Blob name: {blob.name}")
else:
    print(f"File '{file_path}' not found in the bucket.")

Information for 'data/1 - Gen AI - Dauphine Tunis.pptx':
Size: 6724048 bytes
Content Type: application/vnd.openxmlformats-officedocument.presentationml.presentation
Updated On: 2024-10-07 09:52:30.256000+00:00
Blob name: data/1 - Gen AI - Dauphine Tunis.pptx


Output Explanation:

This code will output metadata about the specified file. Make sure to replace 'data/your_file.ext' with the actual file path.

### I.3. Reading Files with Unstructured

Explanation:

The Unstructured library allows us to parse and process unstructured data from various file formats. We‚Äôll download a file from the bucket and use Unstructured to read and extract its content.

In [9]:
import os
from google.cloud.storage.bucket import Bucket
from pptx import Presentation
from langchain_core.documents import Document

# Specify download directory
DOWNLOADED_LOCAL_DIRECTORY = os.path.abspath("./downloaded_files")
os.makedirs(DOWNLOADED_LOCAL_DIRECTORY, exist_ok=True)

def download_file_from_bucket(bucket: Bucket, file_path: str) -> str:
    # Download the file locally
    blob = bucket.blob(file_path)
    local_file_name = os.path.basename(file_path)
    local_filepath = os.path.join(DOWNLOADED_LOCAL_DIRECTORY, local_file_name)
    blob.download_to_filename(local_filepath)
    print(f"Downloaded '{file_path}' to '{local_filepath}'")
    return local_filepath

def read_pptx(filepath: str) -> str:
    prs = Presentation(filepath)
    text = ""
    for slide in prs.slides:
        for shape in slide.shapes:
            if shape.has_text_frame:
                text += shape.text + "\n"
    return text

def read_file_from_local(local_filepath: str) -> list:
    if local_filepath.endswith(".pptx"):
        text = read_pptx(local_filepath)
        return [Document(page_content=text)]
    else:
        loader = UnstructuredLoader(local_filepath)
        return loader.load()

In [11]:
# Load all the
blobs = list(bucket.list_blobs(prefix='data/'))
documents: list[Document] = []
if blobs:
    for blob in blobs:
        try:
            local_filepath = download_file_from_bucket(bucket, blob.name)
            documents.extend(read_file_from_local(local_filepath))
        except Exception as e:
            print(f"An error occurred while processing '{blob.name}': {e}")
else:
    print("No files found in the 'data' directory.")

An error occurred while processing 'data/': [Errno 2] No such file or directory: 'c:\\Users\\USER\\Desktop\\GENERATIVE AI DAUPHINE\\GenAI-GCP\\TPs\\tp_4\\downloaded_files\\'
Downloaded 'data/1 - Gen AI - Dauphine Tunis.pptx' to 'c:\Users\USER\Desktop\GENERATIVE AI DAUPHINE\GenAI-GCP\TPs\tp_4\downloaded_files\1 - Gen AI - Dauphine Tunis.pptx'
Downloaded 'data/2.1 - Before Transformers - Gen AI - Dauphine Tunis.pptx' to 'c:\Users\USER\Desktop\GENERATIVE AI DAUPHINE\GenAI-GCP\TPs\tp_4\downloaded_files\2.1 - Before Transformers - Gen AI - Dauphine Tunis.pptx'
Downloaded 'data/2.2  - Transformers - Gen AI - Dauphine Tunis.pptx' to 'c:\Users\USER\Desktop\GENERATIVE AI DAUPHINE\GenAI-GCP\TPs\tp_4\downloaded_files\2.2  - Transformers - Gen AI - Dauphine Tunis.pptx'
Downloaded 'data/3 - Retrieval Augmented Generation - Gen AI - Dauphine Tunis.pptx' to 'c:\Users\USER\Desktop\GENERATIVE AI DAUPHINE\GenAI-GCP\TPs\tp_4\downloaded_files\3 - Retrieval Augmented Generation - Gen AI - Dauphine Tunis.pp

### I.4. Visualizing the First Documents Extracted with LangChain

Explanation:

LangChain is a framework for developing applications powered by language models. We‚Äôll use it to load and visualize the documents extracted from the file.

In [12]:
for doc in documents[:3]:
    print(f"Content:\n{doc.page_content}\nMetadata:\n{doc.metadata}\n")

Content:
Generative AI with LLM
Florian Bastin
üë®üèº‚Äçüéì Master MASH - Universit√© PSL
üë®üèº‚Äçüíª LLM Engineer @OctoTechnology
Le Monde, Casino, Channel, Club Med, Pernod Ricard, Suez
‚Äπ#‚Ä∫
‚Äπ#‚Ä∫
I.A Pretraining Large Language Model

Pre training phase

A. Pretraining a Large Language Model
Introduction
Cross entropy loss
Tokenization
Evaluation
Data preprocessing
Scaling laws
Training process
Cost and optimization
Autoregressive language models:

The chain rule of probability:  p(x1, x,2, ‚Ä¶, xn) = p(x1) p(x2| x1) p(x3| x2,x1) ‚Ä¶
Language modelling
‚Äπ#‚Ä∫
I.A.1 Introduction
Language Models: probability distribution over a sequence of words p(x1, ‚Ä¶ xn)
			
P(Transformers, are, encoder, decoder, models) = 0.01

P(Transformers, are, are, encoder, decoder, models) = 0.0001  	Syntactic knowledge

P(Transformers, are, decoder, models) = 0.001 	Semantic knowledge








P(Transformers, are, encoder, decoder, models) = P(Transformers)
     . P(Transformers are | Transform

### I.5. Join extracted document by page

Explanation:

- The text extraction block is uninformative because very small text blocks are extracted from the document.
- We can join the extracted text by page to get a more meaningful output.
- A metadata with the 'page_number' can be helpful
- The other metadatas need to be merged

In [16]:
from collections import defaultdict
import os
from pptx import Presentation

# Supposons que Document est une classe avec `metadata` et `page_content` comme attributs.
class Document:
    def __init__(self, page_content, metadata):
        self.page_content = page_content
        self.metadata = metadata

# Fonction pour lire le contenu d'un fichier PowerPoint
def extract_pptx_content(file_path: str) -> list[Document]:
    presentation = Presentation(file_path)
    documents = []
    
    for i, slide in enumerate(presentation.slides):
        # Extraire le texte de chaque slide
        slide_content = "\n".join([shape.text for shape in slide.shapes if hasattr(shape, "text")])
        
        # Cr√©er un document pour chaque slide
        metadata = {
            'source': file_path,
            'category_depth': 1,
            'file_directory': os.path.dirname(file_path),
            'filename': os.path.basename(file_path),
            'last_modified': os.path.getmtime(file_path),  # R√©cup√®re la date de derni√®re modification
            'page_number': i + 1,  # Num√©ro de page bas√© sur l'index de la slide
            'languages': ['eng'],  # Langues par d√©faut
            'filetype': 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
            'category': 'Title',  # Cat√©gorie par d√©faut
            'element_id': f"{i:08d}",  # ID √©l√©ment unique pour chaque slide
        }
        
        documents.append(Document(slide_content, metadata))
    
    return documents

# Fonction pour fusionner les documents par num√©ro de page
def merge_documents_by_page(documents: list[Document]) -> list[Document]:
    merged_documents: list[Document] = []
    page_dict = {}

    # Grouper les documents par num√©ro de page
    for doc in documents:
        page_number = doc.metadata.get('page_number')
        if page_number is not None:
            if page_number not in page_dict:
                page_dict[page_number] = [doc]
            else:
                page_dict[page_number].append(doc)

    # Fusionner les documents pour chaque page
    for page_number, docs in page_dict.items():
        if docs:
            # Utiliser les m√©tadonn√©es du premier document du groupe
            merged_metadata = docs[0].metadata
            # Concat√©ner le contenu des pages de tous les documents du groupe
            merged_content = "\n".join([doc.page_content for doc in docs])
            # R√©p√©ter le contenu fusionn√© pour obtenir plusieurs r√©p√©titions
            merged_content = "\n".join([merged_content] * 5)  # R√©p√©ter le contenu 5 fois
            # Cr√©er un nouveau Document avec le contenu et les m√©tadonn√©es fusionn√©s
            merged_documents.append(Document(merged_content, merged_metadata))

    return merged_documents

# Charger les documents depuis le r√©pertoire `downloaded_files`
def load_documents_from_directory(directory: str) -> list[Document]:
    documents = []
    for filename in os.listdir(directory):
        if filename.endswith(".pptx"):
            file_path = os.path.join(directory, filename)
            documents.extend(extract_pptx_content(file_path))  # Ajouter les documents extraits
    return documents

# Sp√©cifier le r√©pertoire contenant vos fichiers
directory_path = './downloaded_files'

# Charger les documents depuis le r√©pertoire
documents = load_documents_from_directory(directory_path)

# Fusionner les documents par page
merged_documents = merge_documents_by_page(documents)

# Afficher les documents fusionn√©s dans le format souhait√©
for doc in merged_documents:
    print("-" * 50)
    print(f"Page Number: {doc.metadata.get('page_number')}")
    print(f"Content:\n{doc.page_content}")
    print(f"Metadata:\n{doc.metadata}")
    print("-" * 50)

--------------------------------------------------
Page Number: 1
Content:
Generative AI with LLM
Florian Bastin
üë®üèº‚Äçüéì Master MASH - Universit√© PSL
üë®üèº‚Äçüíª LLM Engineer @OctoTechnology
Le Monde, Casino, Channel, Club Med, Pernod Ricard, Suez
‚Äπ#‚Ä∫
Generative AI with LLM
Florian Bastin
üë®üèº‚Äçüéì Master MASH - Universit√© PSL
üë®üèº‚Äçüíª LLM Engineer @OctoTechnology
Le Monde, Casino, Channel, Club Med, Pernod Ricard, Suez
‚Äπ#‚Ä∫
Generative AI with LLM
Florian Bastin
üë®üèº‚Äçüéì Master MASH - Universit√© PSL
üë®üèº‚Äçüíª LLM Engineer @OctoTechnology
Le Monde, Casino, Channel, Club Med, Pernod Ricard, Suez
‚Äπ#‚Ä∫
Generative AI with LLM
Florian Bastin
üë®üèº‚Äçüéì Master MASH - Universit√© PSL
üë®üèº‚Äçüíª LLM Engineer @OctoTechnology
Le Monde, Casino, Channel, Club Med, Pernod Ricard, Suez
‚Äπ#‚Ä∫
Generative AI with LLM
Florian Bastin
üë®üèº‚Äçüéì Master MASH - Universit√© PSL
üë®üèº‚Äçüíª LLM Engineer @OctoTechnology
Le Monde, Casino, Ch

# II. Ingesting in Cloud SQL

We will ingest each merged_document in Cloud SQL.

ALREADY DONE by teacher: 
- Create a Cloud SQL instance
- Create a database in the instance


TODO:
- Create a table in CloudSQL with you initials
- Create the schema of the table
- Ingest the data in the table


Follow this [documentation](https://python.langchain.com/docs/integrations/vectorstores/google_cloud_sql_pg/)

### II.1 Understand how to connect to Cloud SQL 


First we need to connect to Cloud SQL 
- Follow this [link](https://cloud.google.com/sql/docs/postgres/connect-instance-auth-proxy) to understand how it works

Then be familiar ith the following PostgreSQL commands:
```bash 
`psql "host=127.0.0.1 port=5432 sslmode=disable dbname=gen_ai_db user=postgres"` # to connect to the user `postgres`
# the user we use is `students`
# a password provided by the teacher is required
`\l` # to list all databases
`\c gen_ai_db` # to connect to the database `gen_ai_db`
`\dt` # to list all tables
`\d+ table_name` # to describe a table
`SELECT * FROM table_name` # to select all rows from a table
`\du` # to list all users
`\q` # to quit
`CREATE DATABASE db_name;` # to create a database
`CREATE USER user_name WITH PASSWORD 'password';` # to create a user
`GRANT ALL PRIVILEGES ON DATABASE db_name TO user_name;` # to grant all privileges to a user on a database
`GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO user_name;` # to grant all privileges to a user on all tables in a schema
`ALTER USER user_name WITH SUPERUSER;` # to grant superuser privileges to a user
`DROP DATABASE db_name;` # to drop a database
`DROP USER user_name;` # to drop a user
`DROP TABLE table_name;` # to drop a table
`REVOKE ALL PRIVILEGES ON DATABASE db_name FROM user_name;` # to revoke all privileges from a user on a database
```

When Cloud SQL Proxy is downloaded and the tutorial is followed. You should be connected to the instance. 
You can connect to the dabase as a user `students` with the password provided by the teacher.
  - `psql "host=127.0.0.1 port=5432 sslmode=disable dbname=gen_ai_db user=students"`
  - Enter the password provided by the teacher
Try to create a table `initial_tests_table` with the following schema:
  - `CREATE TABLE initial_tests_table (id SERIAL PRIMARY KEY, document TEXT, page_number INT, title TEXT, author TEXT, date TEXT);`
  - `\dt` to check if the table has been created
  - `\d+ initial_tests_table` to check the schema of the table
  - `DROP TABLE initial_tests_table;` to drop the table
  - `\q` to quit


In [17]:
%pip install --upgrade --quiet  langchain-google-cloud-sql-pg langchain-google-vertexai

Note: you may need to restart the kernel to use updated packages.


In [1]:
from dotenv import load_dotenv
load_dotenv(dotenv_path=".env.template")

True

In [2]:
import os
from config import PROJECT_ID, REGION, INSTANCE, DATABASE, DB_USER
DB_PASSWORD = os.environ["DB_PASSWORD"]

In [3]:
TABLE_NAME = "sk_table" # Table name in the database initials-table. Ex: fb_table

In [4]:
from langchain_google_cloud_sql_pg import PostgresEngine

# Connect to the PostgreSQL database
engine = PostgresEngine.from_instance(
    project_id=PROJECT_ID,
    instance=INSTANCE,
    region=REGION,
    database=DATABASE,
    user=DB_USER,
    password=DB_PASSWORD,
)

In [5]:
# Create a table in the PostgreSQL database with the required columns
from sqlalchemy.exc import ProgrammingError

try:
    await engine.ainit_vectorstore_table(
        table_name=TABLE_NAME, # Vector size for VertexAI model(textembedding-gecko@latest)
        vector_size=768,
    )
except ProgrammingError:
    print("Table already created")

Table already created


- Execute \d+ [YOUR_INITIALS]_table in the psql shell to check the schema of the table

### II.2 Create an embedding to convert your documents

In [7]:
from langchain_google_vertexai import VertexAIEmbeddings

embedding = VertexAIEmbeddings(
    model_name="textembedding-gecko@latest",  # Specify the embedding model name
    project="dauphine-437611" 
)

In [8]:
from langchain_google_cloud_sql_pg import PostgresVectorStore

vector_store = PostgresVectorStore.create_sync(  # Use .create() to initialize an async vector store
    engine=engine,
    table_name=TABLE_NAME,
    embedding_service=embedding,
)

In [10]:
#vector_store.add_documents(merged_documents)
# Excute only once this cell

### II.3 Perform a similarity search

In [11]:
query = "How to train a Large Language Model?"

In [14]:
query = "How to train a Large Language Model?"

retriever = vector_store.as_retriever(
    search_type="similarity",  # Default search type for vector similarity
    search_kwargs={"k": 5}  # Retrieve the top 5 most similar documents
)

# Use .invoke() instead of .get_relevant_documents()
docs = retriever.invoke(query)


In [15]:
for doc in docs:
    print("-" * 50)
    print("Content: ", doc.page_content)
    print("Metadata: ", doc.metadata)

--------------------------------------------------
Content:  Training process
‚Äπ#‚Ä∫
I.A.7 Training Process
Steps 

Find scaling recipes (example: learning rate decrease if the size of the model increase)
Tune hyper parameters on small models of differents size
Choose the best models among the smallest ones
Train the biggest model with the 


Stanford CS229 I Machine Learning I Building Large Language Models (LLMs) [Youtube]
Q. Should I use Transformers or LSTM ? 
‚Äπ#‚Ä∫
II.A.3 RNN

Recurrent Neural Networks (Seq2seq model)
Embedding
Transformers
‚Äπ#‚Ä∫
II.B.1 Self Attention Mechanism
   are
encoder
decoder
2.11
-4.22
..
..
5.93
2.43
-3.2
..
..
3.32
2.11
-4.22
..
..
1.12
3.11
-4.22
..
..
4.98
Query		
2.11
-4.22
..
..
5.93
2.11
-4.22
..
..
5.93
2.11
-4.22
..
..
5.93
2.11
-4.22
..
..
5.93
WQ
E1
E2
E3
Q1
Q2
Q3
Q4
Are we talking about TV ? 
Do I mean Allocation de Retour √† l‚ÄôEmploi ?
Am I a superstar ?
‚Ä¶
Query: What am I looking for ? 
|E| : Embedding (1, 12 288)
|WQ|: Query matrix

**Congratulations**! You have successfully ingested the data in Cloud SQL.