## Introduction

In this tutorial, we’ll walk through the process of interacting with a Google Cloud Storage (GCS) bucket named dauphine-bucket, specifically focusing on the data directory within the bucket. We’ll cover how to:

- List all files in the bucket’s data directory.
- Retrieve information about a specific file.
- Read files using the Unstructured library.
- Visualize the extracted documents with LangChain.

This guide is intended for users who are familiar with Python and basic cloud storage concepts.

Prerequisites

Before we begin, ensure you have the following:

- Python 3.x installed on your system.
- Access to the GCP bucket dauphine-bucket/data with the necessary permissions.
- Google Cloud SDK installed and authenticated. You can authenticate by running:

In [2]:
!gcloud auth login

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=32555940559.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=FZObwIV5155UK6ywYxBpyfU6obFTTt&access_type=offline&code_challenge=caUCyCi7ohwJCMnbkMGcXAdhQYNCeI7nwYFGdrRKx6o&code_challenge_method=S256


You are now logged in as [mariem.inoubli888@gmail.com].
Your current project is [dauphine-437611].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID


The following Python libraries installed:
- google-cloud-storage
- unstructured
- langchain

To know more about the libraries, you can visit the following links:
- [google-cloud-storage](https://googleapis.dev/python/storage/latest/index.html)
- [unstructured](https://docs.unstructured.io/examplecode/codesamples/oss/vector-database)
- [langchain](https://langchain.readthedocs.io/en/latest/)


In [2]:
!pip install  google-cloud-storage unstructured langchain python-magic sqlalchemy langchain_google_cloud_sql_pg
!pip install  "unstructured[pptx]"



In [3]:
%pip install python-magic-bin

Note: you may need to restart the kernel to use updated packages.


# I. Listing and loading files from a GCS bucket

### I.1. Listing Files in the GCP Bucket

Explanation:

To interact with a GCS bucket, we’ll use the google-cloud-storage library. We’ll initialize a client, access the bucket, and list all the files within the data directory.

Code:

In [3]:
# Import the necessary library
from google.cloud import storage

# Initialize a client
client = storage.Client()

# Access the bucket
bucket_name = "dauphine-bucket"
bucket = client.get_bucket(bucket_name)

# List all files in the 'data' directory
blobs = bucket.list_blobs(prefix="data/")

print("Files in 'dauphine-bucket/data':")
for blob in blobs:
    print(blob.name)





Files in 'dauphine-bucket/data':
data/
data/1 - Gen AI - Dauphine Tunis.pptx
data/2.1 - Before Transformers - Gen AI - Dauphine Tunis.pptx
data/2.2  - Transformers - Gen AI - Dauphine Tunis.pptx
data/3 - Retrieval Augmented Generation - Gen AI - Dauphine Tunis.pptx


Output Explanation:

Running this code will display all the file paths within the data directory of the bucket. The prefix='data/' parameter ensures we only get files from that specific directory.

### I.2. Getting Information About One File


Explanation:

Sometimes, you may need detailed information about a specific file, such as its size, content type, or the last time it was updated. We’ll retrieve this metadata for a chosen file.


In [4]:
# Specify the file path (replace with an actual file from your bucket)
file_path = "data/1 - Gen AI - Dauphine Tunis.pptx"

# Get the blob object
blob = bucket.get_blob(file_path)

# TODO
if blob:
    print(f"Information for '{file_path}':")
    print(f"Size: {blob.size} bytes")
    print(f"Content Type: {blob.content_type}")
    print(f"Updated On: '{blob.updated}'")
    print(f"Blob name: {blob.name}")
else:
    print(f"File '{file_path}' not found in the bucket.")

Information for 'data/1 - Gen AI - Dauphine Tunis.pptx':
Size: 6724048 bytes
Content Type: application/vnd.openxmlformats-officedocument.presentationml.presentation
Updated On: '2024-10-07 09:52:30.256000+00:00'
Blob name: data/1 - Gen AI - Dauphine Tunis.pptx


In [5]:
import os
DOWNLOADED_LOCAL_DIRECTORY = "./downloaded_files"


os.makedirs(DOWNLOADED_LOCAL_DIRECTORY, exist_ok=True)


Output Explanation:

This code will output metadata about the specified file. Make sure to replace 'data/your_file.ext' with the actual file path.

### I.3. Reading Files with Unstructured

Explanation:

The Unstructured library allows us to parse and process unstructured data from various file formats. We’ll download a file from the bucket and use Unstructured to read and extract its content.

In [41]:
# Import necessary libraries
from langchain_core.documents import Document
import os
from google.cloud.storage.bucket import Bucket
from langchain_unstructured import UnstructuredLoader
from pptx import Presentation


DOWNLOADED_LOCAL_DIRECTORY = "./downloaded_files"
file_path = "data/1 - Gen AI - Dauphine Tunis.pptx"

# Function to download the file: file_path from the GCS Bucket
def download_file_from_bucket(bucket: storage.Bucket, file_path: str) -> str:
    blob = bucket.blob(file_path)
    local_file_name = os.path.basename(file_path)
    local_filepath = os.path.join(DOWNLOADED_LOCAL_DIRECTORY, local_file_name)
    blob.download_to_filename(local_filepath)
    print(f"Downloaded '{file_path}' to '{local_filepath}'")
    return local_filepath

from pptx import Presentation
from datetime import datetime
import uuid  # Pour générer des IDs uniques

def extract_detailed_metadata(filepath: str) -> list[Document]:
    presentation = Presentation(filepath)
    documents = []

    # Métadonnées du fichier
    filename = os.path.basename(filepath)
    file_directory = os.path.dirname(filepath)
    last_modified = datetime.fromtimestamp(os.path.getmtime(filepath)).isoformat()
    filetype = "application/vnd.openxmlformats-officedocument.presentationml.presentation"

    # Extraire les éléments de chaque diapositive
    for slide_idx, slide in enumerate(presentation.slides):
        parent_id = None  # Réinitialiser pour chaque nouvelle diapositive

        for shape_idx, shape in enumerate(slide.shapes):
            if shape.has_text_frame:  # Vérifie si l'élément contient un texte
                for paragraph_idx, paragraph in enumerate(shape.text_frame.paragraphs):
                    text = paragraph.text.strip()
                    if text:  # Ignore les paragraphes vides
                        # Générer un ID unique pour chaque élément
                        element_id = str(uuid.uuid4())

                        # Déterminer la catégorie et la profondeur
                        if parent_id is None or shape_idx == 0:  
                            category = "Title"
                            category_depth = 1
                            parent_id = element_id  
                        else:
                            category = "Content"
                            category_depth = 3

                        # Métadonnées associées
                        metadata = {
                            "source": filepath,
                            "category_depth": category_depth,
                            "file_directory": file_directory,
                            "filename": filename,
                            "last_modified": last_modified,
                            "page_number": slide_idx + 1,
                            "languages": ["eng"],
                            "filetype": filetype,
                            "category": category,
                            "element_id": element_id,
                        }

                        # Ajouter le parent_id si pertinent
                        if category == "Content":
                            metadata["parent_id"] = parent_id

                        # Créer un `Document` pour chaque paragraphe
                        documents.append(Document(page_content=text, metadata=metadata))

    return documents







def read_file_from_local(local_filepath: str) -> list[Document]:
    if local_filepath.endswith(".pptx"):
        return extract_detailed_metadata(local_filepath)
    else:
        # Initialize UnstructuredLoader for other file types
        loader = UnstructuredLoader(file_path=local_filepath)
        return loader.load()


In [42]:
# Load all the
blobs = list(bucket.list_blobs(prefix='data/'))
documents: list[Document] = []
if blobs:
    for blob in blobs:
        try:
            local_filepath = download_file_from_bucket(bucket, blob.name)
            documents.extend(read_file_from_local(local_filepath))
        except Exception as e:
            print(f"An error occurred while processing '{blob.name}': {e}")
else:
    print("No files found in the 'data' directory.")

An error occurred while processing 'data/': [Errno 2] No such file or directory: './downloaded_files\\'
Downloaded 'data/1 - Gen AI - Dauphine Tunis.pptx' to './downloaded_files\1 - Gen AI - Dauphine Tunis.pptx'
Downloaded 'data/2.1 - Before Transformers - Gen AI - Dauphine Tunis.pptx' to './downloaded_files\2.1 - Before Transformers - Gen AI - Dauphine Tunis.pptx'
Downloaded 'data/2.2  - Transformers - Gen AI - Dauphine Tunis.pptx' to './downloaded_files\2.2  - Transformers - Gen AI - Dauphine Tunis.pptx'
Downloaded 'data/3 - Retrieval Augmented Generation - Gen AI - Dauphine Tunis.pptx' to './downloaded_files\3 - Retrieval Augmented Generation - Gen AI - Dauphine Tunis.pptx'


### I.4. Visualizing the First Documents Extracted with LangChain

Explanation:

LangChain is a framework for developing applications powered by language models. We’ll use it to load and visualize the documents extracted from the file.

In [43]:
for doc in documents[:3]:
    print(f"Content:\n{doc.page_content}\nMetadata:\n{doc.metadata}\n")


Content:
Generative AI with LLM
Metadata:
{'source': './downloaded_files\\1 - Gen AI - Dauphine Tunis.pptx', 'category_depth': 1, 'file_directory': './downloaded_files', 'filename': '1 - Gen AI - Dauphine Tunis.pptx', 'last_modified': '2024-12-09T20:29:53.760732', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.presentationml.presentation', 'category': 'Title', 'element_id': '00cdf085-1e67-4fe4-9239-323f11e984e1'}

Content:
Florian Bastin
Metadata:
{'source': './downloaded_files\\1 - Gen AI - Dauphine Tunis.pptx', 'category_depth': 1, 'file_directory': './downloaded_files', 'filename': '1 - Gen AI - Dauphine Tunis.pptx', 'last_modified': '2024-12-09T20:29:53.760732', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.presentationml.presentation', 'category': 'Title', 'element_id': '8d684e44-003e-4c3e-acdd-811a7e4c178b'}

Content:
👨🏼‍🎓 Master MASH - Université PSL
Metadata:
{'source': './d

### I.5. Join extracted document by page

Explanation:

- The text extraction block is uninformative because very small text blocks are extracted from the document.
- We can join the extracted text by page to get a more meaningful output.
- A metadata with the 'page_number' can be helpful
- The other metadatas need to be merged

In [44]:
from langchain.schema import Document
from collections import defaultdict

# Function to merge documents by page number
def merge_documents_by_page(documents: list[Document]) -> list[Document]:
    merged_documents: list[Document] = []
    page_dict = {}

    # Group documents by page number
    for doc in documents:
        page_number = doc.metadata.get('page_number')
        if page_number is not None:
            if page_number not in page_dict:
                page_dict[page_number] = [doc]
            else:
                page_dict[page_number].append(doc)

    # Merge documents for each page
    for page_number, docs in page_dict.items():
        if docs:
            # Use the metadata of the first document in the group
            merged_metadata = docs[0].metadata
            # Concatenate the page content of all documents in the group
            merged_content = "\n".join([doc.page_content for doc in docs])
            
            # Create a new Document with merged content and metadata
            merged_documents.append(Document(
                page_content=merged_content,
                metadata=merged_metadata
            ))

    return merged_documents

# Example: Assume documents are already extracted
merged_documents = merge_documents_by_page(documents)
print(merged_documents)
# Print the merged documents
for doc in merged_documents:
    print("-" * 50)
    print(f"Page Number: {doc.metadata.get('page_number')}")
    print(f"Content:\n{doc.page_content}\n")
    print(f"Metadata:\n{doc.metadata}\n")
    print("-" * 50)


[Document(metadata={'source': './downloaded_files\\1 - Gen AI - Dauphine Tunis.pptx', 'category_depth': 1, 'file_directory': './downloaded_files', 'filename': '1 - Gen AI - Dauphine Tunis.pptx', 'last_modified': '2024-12-09T20:29:53.760732', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.presentationml.presentation', 'category': 'Title', 'element_id': '00cdf085-1e67-4fe4-9239-323f11e984e1'}, page_content='Generative AI with LLM\nFlorian Bastin\n👨🏼\u200d🎓 Master MASH - Université PSL\n👨🏼\u200d💻 LLM Engineer @OctoTechnology\nLe Monde, Casino, Channel, Club Med, Pernod Ricard, Suez\n‹#›\nGenerative AI with LLM\nFlorian Bastin\n👨🏼\u200d🎓 Master MASH - Université PSL\n👨🏼\u200d💻 LLM Engineer @OctoTechnology\nLe Monde, Casino, Channel, Club Med, Pernod Ricard, Suez\n‹#›\nGenerative AI with LLM\nFlorian Bastin\n👨🏼\u200d🎓 Master MASH - Université PSL\n👨🏼\u200d💻 LLM Engineer @OctoTechnology\nLe Monde, Casino, Channel, Club Med, Pernod Ricard, S

# II. Ingesting in Cloud SQL

We will ingest each merged_document in Cloud SQL.

ALREADY DONE by teacher: 
- Create a Cloud SQL instance
- Create a database in the instance


TODO:
- Create a table in CloudSQL with you initials
- Create the schema of the table
- Ingest the data in the table


Follow this [documentation](https://python.langchain.com/docs/integrations/vectorstores/google_cloud_sql_pg/)

### II.1 Understand how to connect to Cloud SQL 


First we need to connect to Cloud SQL 
- Follow this [link](https://cloud.google.com/sql/docs/postgres/connect-instance-auth-proxy) to understand how it works

Then be familiar ith the following PostgreSQL commands:
```bash 
`psql "host=127.0.0.1 port=5432 sslmode=disable dbname=gen_ai_db user=postgres"` # to connect to the user `postgres`
# the user we use is `students`
# a password provided by the teacher is required
`\l` # to list all databases
`\c gen_ai_db` # to connect to the database `gen_ai_db`
`\dt` # to list all tables
`\d+ table_name` # to describe a table
`SELECT * FROM table_name` # to select all rows from a table
`\du` # to list all users
`\q` # to quit
`CREATE DATABASE db_name;` # to create a database
`CREATE USER user_name WITH PASSWORD 'password';` # to create a user
`GRANT ALL PRIVILEGES ON DATABASE db_name TO user_name;` # to grant all privileges to a user on a database
`GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO user_name;` # to grant all privileges to a user on all tables in a schema
`ALTER USER user_name WITH SUPERUSER;` # to grant superuser privileges to a user
`DROP DATABASE db_name;` # to drop a database
`DROP USER user_name;` # to drop a user
`DROP TABLE table_name;` # to drop a table
`REVOKE ALL PRIVILEGES ON DATABASE db_name FROM user_name;` # to revoke all privileges from a user on a database
```

When Cloud SQL Proxy is downloaded and the tutorial is followed. You should be connected to the instance. 
You can connect to the dabase as a user `students` with the password provided by the teacher.
  - `psql "host=127.0.0.1 port=5432 sslmode=disable dbname=gen_ai_db user=students"`
  - Enter the password provided by the teacher
Try to create a table `initial_tests_table` with the following schema:
  - `CREATE TABLE initial_tests_table (id SERIAL PRIMARY KEY, document TEXT, page_number INT, title TEXT, author TEXT, date TEXT);`
  - `\dt` to check if the table has been created
  - `\d+ initial_tests_table` to check the schema of the table
  - `DROP TABLE initial_tests_table;` to drop the table
  - `\q` to quit


In [40]:
%pip install --upgrade --quiet  langchain-google-cloud-sql-pg langchain-google-vertexai

Note: you may need to restart the kernel to use updated packages.


In [45]:
from dotenv import load_dotenv
load_dotenv()

True

In [46]:
import os
from config import PROJECT_ID, REGION, INSTANCE, DATABASE, DB_USER
DB_PASSWORD = os.environ["DB_PASSWORD"]

In [48]:
TABLE_NAME = "meriam_in_table" 

In [49]:
from langchain_google_cloud_sql_pg import PostgresEngine

# Connect to the PostgreSQL database
engine = PostgresEngine.from_instance(
    project_id=PROJECT_ID,
    instance=INSTANCE,
    region=REGION,
    database=DATABASE,
    user=DB_USER,
    password=DB_PASSWORD,
)


In [51]:
# Create a table in the PostgreSQL database with the required columns
from sqlalchemy.exc import ProgrammingError

try:
    await engine.ainit_vectorstore_table(
        table_name=TABLE_NAME, 
        vector_size=768,
    )
except ProgrammingError:
    print("Table already created")

Table already created


- Execute \d+ [YOUR_INITIALS]_table in the psql shell to check the schema of the table

### II.2 Create an embedding to convert your documents

In [54]:
from langchain_google_vertexai import VertexAIEmbeddings

embedding = VertexAIEmbeddings(
    model_name="textembedding-gecko@latest",
    project=PROJECT_ID
)

In [55]:
from langchain_google_cloud_sql_pg import PostgresVectorStore

vector_store = PostgresVectorStore.create_sync(  
    engine=engine,
    table_name=TABLE_NAME,
    embedding_service=embedding,
)

In [None]:
#vector_store.add_documents(merged_documents)
# Excute only once this cell

['0ac93dc4-0395-4355-b85c-ee9710807bb3',
 'cd212bcc-61c2-41b9-abfe-89ade65e704c',
 '77f7706a-9a93-4248-89a0-74d8db02baad',
 '6e6f012e-72f9-4412-bfbe-f883f0980b06',
 '4cf2835b-db45-4723-9f47-0b5a4f48c537',
 'b833b0e8-df45-47ec-82c6-ced77b53739d',
 '2e1e03ce-3828-4977-901d-8ac1092aafaa',
 'c41fc8d3-ed27-42a0-b1d7-cfc199f1a2d3',
 '2c181663-258c-4489-93f0-ee1064dc8791',
 '261a1f35-524d-4e59-a8ab-6809cef17201',
 '6eaaa5f0-3e15-4905-8a19-4c662efd1ffc',
 '90db450a-6df1-4d65-8956-d0c292b8955f',
 'fbe3c6ab-4c0b-496a-9e6d-1ffe678b790a',
 'f87a5753-941d-4738-a13f-b0801122ed10',
 '33210c6c-fd9b-4ead-96f3-5395303785a1',
 'd0b9300a-1b49-436b-ac91-cc369ed6c68a',
 '8219242b-30c5-41e5-922e-272d84419c43',
 'dc8ec39b-6594-4811-958a-9292ba60e06e',
 '0619635c-d762-4c8b-b160-7431fa7e525d',
 'dd8478d7-08bb-484f-aeba-20271364725e',
 '4ae04af7-c2c3-4e22-bf5b-bd19316259b8',
 '0e3286ac-2459-4bac-bfca-c950f7f44616',
 '06ec101d-a56c-461c-ae0a-07bfbf4157a5',
 '43231224-1e38-4f45-8d2e-dddf819c3d26',
 'e6d6c45d-46e5-

### II.3 Perform a similarity search

In [57]:
query = "How to train a Large Language Model?"

In [59]:
retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.5}
)

docs = retriever.invoke(query)

In [60]:
for doc in docs:
    print("-" * 50)
    print("Content: ", doc.page_content)
    print("Metadata: ", doc.metadata)

print(docs)

--------------------------------------------------
Content:  Training process
‹#›
I.A.7 Training Process
Steps
Find scaling recipes (example: learning rate decrease if the size of the model increase)
Tune hyper parameters on small models of differents size
Choose the best models among the smallest ones
Train the biggest model with the
Stanford CS229 I Machine Learning I Building Large Language Models (LLMs) [Youtube]
Q. Should I use Transformers or LSTM ?
‹#›
II.A.3 RNN
Recurrent Neural Networks (Seq2seq model)
Embedding
Transformers
‹#›
II.B.1 Self Attention Mechanism
are
encoder
decoder
2.11
-4.22
..
..
5.93
2.43
-3.2
..
..
3.32
2.11
-4.22
..
..
1.12
3.11
-4.22
..
..
4.98
Query
2.11
-4.22
..
..
5.93
2.11
-4.22
..
..
5.93
2.11
-4.22
..
..
5.93
2.11
-4.22
..
..
5.93
WQ
E1
E2
E3
Q1
Q2
Q3
Q4
Are we talking about TV ?
Do I mean Allocation de Retour à l’Emploi ?
Am I a superstar ?
…
Query: What am I looking for ?
|E| : Embedding (1, 12 288)
|WQ|: Query matrix (12 288, 128)
III.2. Informati

**Congratulations**! You have successfully ingested the data in Cloud SQL.