## Introduction

In this tutorial, we’ll walk through the process of interacting with a Google Cloud Storage (GCS) bucket named dauphine-bucket, specifically focusing on the data directory within the bucket. We’ll cover how to:

- List all files in the bucket’s data directory.
- Retrieve information about a specific file.
- Read files using the Unstructured library.
- Visualize the extracted documents with LangChain.

This guide is intended for users who are familiar with Python and basic cloud storage concepts.

Prerequisites

Before we begin, ensure you have the following:

- Python 3.x installed on your system.
- Access to the GCP bucket dauphine-bucket/data with the necessary permissions.
- Google Cloud SDK installed and authenticated. You can authenticate by running:

In [1]:
!gcloud auth login

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=32555940559.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=8HUXNz6oRlUK4rKBZJ09OHOhvoHMX1&access_type=offline&code_challenge=dUs-iUh03KRRVSbyghhzU7hwlCxYcYAN-3dvVOrKxyc&code_challenge_method=S256


You are now logged in as [linathabet101@gmail.com].
Your current project is [dauphine-437611].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID


The following Python libraries installed:
- google-cloud-storage
- unstructured
- langchain

To know more about the libraries, you can visit the following links:
- [google-cloud-storage](https://googleapis.dev/python/storage/latest/index.html)
- [unstructured](https://docs.unstructured.io/examplecode/codesamples/oss/vector-database)
- [langchain](https://langchain.readthedocs.io/en/latest/)


In [1]:
%pip install -q google-cloud-storage unstructured langchain python-magic sqlalchemy langchain_google_cloud_sql_pg
%pip install -q "unstructured[pptx]"

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


# I. Listing and loading files from a GCS bucket

### I.1. Listing Files in the GCP Bucket

Explanation:

To interact with a GCS bucket, we’ll use the google-cloud-storage library. We’ll initialize a client, access the bucket, and list all the files within the data directory.

Code:

In [2]:
# Import the necessary library
from google.cloud import storage

# Initialize a client
client = storage.Client()

# Access the bucket
bucket_name = 'dauphine-bucket'
bucket = client.get_bucket(bucket_name)  # Access the bucket

# List all files in the 'data' directory
blobs = bucket.list_blobs(prefix='data/')  # List blobs with 'data/' prefix

print("Files in 'dauphine-bucket/data':")
for blob in blobs:
    print(blob.name)


Files in 'dauphine-bucket/data':
data/
data/1 - Gen AI - Dauphine Tunis.pptx
data/2.1 - Before Transformers - Gen AI - Dauphine Tunis.pptx
data/2.2  - Transformers - Gen AI - Dauphine Tunis.pptx
data/3 - Retrieval Augmented Generation - Gen AI - Dauphine Tunis.pptx


Output Explanation:

Running this code will display all the file paths within the data directory of the bucket. The prefix='data/' parameter ensures we only get files from that specific directory.

### I.2. Getting Information About One File


Explanation:

Sometimes, you may need detailed information about a specific file, such as its size, content type, or the last time it was updated. We’ll retrieve this metadata for a chosen file.


In [3]:
# Specify the file path (replace with an actual file from your bucket)
file_path = 'data/1 - Gen AI - Dauphine Tunis.pptx'

# Get the blob object
blob = bucket.get_blob(file_path)

if blob:
    print(f"Information for '{file_path}':")
    print(f"Size: {blob.size} bytes")
    print(f"Content Type: {blob.content_type}")
    print(f"Updated On: {blob.updated}")  # Retrieve the last updated timestamp
    print(f"Blob Name: {blob.name}")
else:
    print(f"File '{file_path}' not found in the bucket.")

Information for 'data/1 - Gen AI - Dauphine Tunis.pptx':
Size: 6724048 bytes
Content Type: application/vnd.openxmlformats-officedocument.presentationml.presentation
Updated On: 2024-10-07 09:52:30.256000+00:00
Blob Name: data/1 - Gen AI - Dauphine Tunis.pptx


Output Explanation:

This code will output metadata about the specified file. Make sure to replace 'data/your_file.ext' with the actual file path.

### I.3. Reading Files with Unstructured

Explanation:

The Unstructured library allows us to parse and process unstructured data from various file formats. We’ll download a file from the bucket and use Unstructured to read and extract its content.

In [4]:
# Import necessary libraries
from langchain.document_loaders import UnstructuredFileLoader
from langchain_core.documents import Document
import os
from google.cloud.storage.bucket import Bucket

DOWNLOADED_LOCAL_FIRECTORY = "./downloaded_files"
os.makedirs(DOWNLOADED_LOCAL_FIRECTORY, exist_ok=True)


# Function to download the file: file_path from the GCS Bucket
def download_file_from_bucket(bucket: Bucket, file_path: str) -> str:
    """
    Downloads a file from a GCS Bucket to the local filesystem.

    Args:
        bucket (Bucket): The GCS bucket object.
        file_path (str): The path of the file within the bucket.

    Returns:
        str: The local file path.
    """
    # Download the file locally
    blob = bucket.blob(file_path)

    # Extract the local file name
    local_file_name = os.path.basename(file_path)

    # Define the local file path
    local_filepath = os.path.join(DOWNLOADED_LOCAL_FIRECTORY, local_file_name)

    # Perform the download
    blob.download_to_filename(local_filepath)
    print(f"Downloaded '{file_path}' to '{local_file_name}'")
    return local_filepath


def read_file_from_local(local_filepath: str) -> list[Document]:
    """
    Reads a file from the local filesystem using UnstructuredLoader
    and returns a list of Document objects.

    Args:
        local_filepath (str): The path to the local file to be read.

    Returns:
        list[Document]: A list of Document objects loaded from the specified file.
    """
    # Initialize the loader with the file path
    loader = UnstructuredFileLoader(local_filepath)

    # Load the documents
    documents = loader.load()

    return documents


In [5]:
# Load all the
blobs = list(bucket.list_blobs(prefix='data/'))
documents: list[Document] = []
if blobs:
    for blob in blobs:
        try:
            local_filepath = download_file_from_bucket(bucket, blob.name)
            documents.extend(read_file_from_local(local_filepath))
        except Exception as e:
            print(f"An error occurred while processing '{blob.name}': {e}")
else:
    print("No files found in the 'data' directory.")

An error occurred while processing 'data/': [Errno 2] No such file or directory: './downloaded_files\\'
Downloaded 'data/1 - Gen AI - Dauphine Tunis.pptx' to '1 - Gen AI - Dauphine Tunis.pptx'


  loader = UnstructuredFileLoader(local_filepath)


An error occurred while processing 'data/1 - Gen AI - Dauphine Tunis.pptx': failed to find libmagic.  Check your installation
Downloaded 'data/2.1 - Before Transformers - Gen AI - Dauphine Tunis.pptx' to '2.1 - Before Transformers - Gen AI - Dauphine Tunis.pptx'
An error occurred while processing 'data/2.1 - Before Transformers - Gen AI - Dauphine Tunis.pptx': failed to find libmagic.  Check your installation
Downloaded 'data/2.2  - Transformers - Gen AI - Dauphine Tunis.pptx' to '2.2  - Transformers - Gen AI - Dauphine Tunis.pptx'
An error occurred while processing 'data/2.2  - Transformers - Gen AI - Dauphine Tunis.pptx': failed to find libmagic.  Check your installation
Downloaded 'data/3 - Retrieval Augmented Generation - Gen AI - Dauphine Tunis.pptx' to '3 - Retrieval Augmented Generation - Gen AI - Dauphine Tunis.pptx'
An error occurred while processing 'data/3 - Retrieval Augmented Generation - Gen AI - Dauphine Tunis.pptx': failed to find libmagic.  Check your installation


In [6]:
from langchain_unstructured import UnstructuredLoader

local_filepath = "downloaded_files/1 - Gen AI - Dauphine Tunis.pptx"

loader = UnstructuredLoader(local_filepath)
documents = loader.load()

for doc in documents:
    print(doc.page_content)


ImportError: failed to find libmagic.  Check your installation

In [7]:
# Print a summary of the documents
for i, doc in enumerate(documents):
    print(f"--- Document {i+1} ---")
    print("Content:", doc.page_content[:300])  # Display the first 300 characters
    print("Metadata:", doc.metadata)
    print("\n")


### I.4. Visualizing the First Documents Extracted with LangChain

Explanation:

LangChain is a framework for developing applications powered by language models. We’ll use it to load and visualize the documents extracted from the file.

In [8]:
for doc in documents[:3]:
    print(f"Content:\n{doc.page_content}\nMetadata:\n{doc.metadata}\n")

### I.5. Join extracted document by page

Explanation:

- The text extraction block is uninformative because very small text blocks are extracted from the document.
- We can join the extracted text by page to get a more meaningful output.
- A metadata with the 'page_number' can be helpful
- The other metadatas need to be merged

In [9]:
from collections import defaultdict
from langchain_core.documents import Document

# Function to merge documents by page number
def merge_documents_by_page(documents: list[Document]) -> list[Document]:
    merged_documents: list[Document] = []
    page_dict = defaultdict(list)

    # Group documents by page number
    for doc in documents:
        page_number = doc.metadata.get('page_number')
        if page_number is not None:
            page_dict[page_number].append(doc)

    # Merge documents for each page
    for page_number, docs in page_dict.items():
        if docs:
            # Use the metadata of the first document in the group
            merged_metadata = docs[0].metadata

            # Concatenate the content of all documents in the group
            merged_content = "\n".join(doc.page_content for doc in docs)

            # Create a new Document with merged content and metadata
            merged_documents.append(Document(page_content=merged_content, metadata=merged_metadata))

    return merged_documents

# Merge the documents by page
merged_documents = merge_documents_by_page(documents)

# Print the merged documents
for doc in merged_documents:
    print("-" * 50)
    print(f"Page Number: {doc.metadata.get('page_number')}")
    print(f"Content:\n{doc.page_content}\nMetadata:\n{doc.metadata}\n")
    print("-" * 50)


# II. Ingesting in Cloud SQL

We will ingest each merged_document in Cloud SQL.

ALREADY DONE by teacher: 
- Create a Cloud SQL instance
- Create a database in the instance


TODO:
- Create a table in CloudSQL with you initials
- Create the schema of the table
- Ingest the data in the table


Follow this [documentation](https://python.langchain.com/docs/integrations/vectorstores/google_cloud_sql_pg/)

### II.1 Understand how to connect to Cloud SQL 


First we need to connect to Cloud SQL 
- Follow this [link](https://cloud.google.com/sql/docs/postgres/connect-instance-auth-proxy) to understand how it works

Then be familiar ith the following PostgreSQL commands:
```bash 
`psql "host=127.0.0.1 port=5432 sslmode=disable dbname=gen_ai_db user=postgres"` # to connect to the user `postgres`
# the user we use is `students`
# a password provided by the teacher is required
`\l` # to list all databases
`\c gen_ai_db` # to connect to the database `gen_ai_db`
`\dt` # to list all tables
`\d+ table_name` # to describe a table
`SELECT * FROM table_name` # to select all rows from a table
`\du` # to list all users
`\q` # to quit
`CREATE DATABASE db_name;` # to create a database
`CREATE USER user_name WITH PASSWORD 'password';` # to create a user
`GRANT ALL PRIVILEGES ON DATABASE db_name TO user_name;` # to grant all privileges to a user on a database
`GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO user_name;` # to grant all privileges to a user on all tables in a schema
`ALTER USER user_name WITH SUPERUSER;` # to grant superuser privileges to a user
`DROP DATABASE db_name;` # to drop a database
`DROP USER user_name;` # to drop a user
`DROP TABLE table_name;` # to drop a table
`REVOKE ALL PRIVILEGES ON DATABASE db_name FROM user_name;` # to revoke all privileges from a user on a database
```

When Cloud SQL Proxy is downloaded and the tutorial is followed. You should be connected to the instance. 
You can connect to the dabase as a user `students` with the password provided by the teacher.
  - `psql "host=127.0.0.1 port=5432 sslmode=disable dbname=gen_ai_db user=students"`
  - Enter the password provided by the teacher
Try to create a table `initial_tests_table` with the following schema:
  - `CREATE TABLE initial_tests_table (id SERIAL PRIMARY KEY, document TEXT, page_number INT, title TEXT, author TEXT, date TEXT);`
  - `\dt` to check if the table has been created
  - `\d+ initial_tests_table` to check the schema of the table
  - `DROP TABLE initial_tests_table;` to drop the table
  - `\q` to quit


In [10]:
%pip install --upgrade --quiet  langchain-google-cloud-sql-pg langchain-google-vertexai

Note: you may need to restart the kernel to use updated packages.


In [12]:
from dotenv import load_dotenv
load_dotenv()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

In [31]:
import os
#from config import PROJECT_ID, REGION, INSTANCE, DATABASE, DB_USER
DB_PASSWORD = os.environ["DB_PASSWORD"]
print(DB_PASSWORD)

KeyError: 'DB_PASSWORD'

In [29]:
TABLE_NAME = "fb_table" # Table name in the database initials-table. Ex: fb_table

In [30]:
from langchain_google_cloud_sql_pg import PostgresEngine

# Connect to the PostgreSQL database
engine = PostgresEngine.from_instance(
    project_id=PROJECT_ID,
    instance=INSTANCE,
    region=REGION,
    database=DATABASE,
    user=DB_USER,
    password=DB_PASSWORD,
)


NameError: name 'DB_PASSWORD' is not defined

In [63]:
# Create a table in the PostgreSQL database with the required columns
from sqlalchemy.exc import ProgrammingError

try:
    await engine.ainit_vectorstore_table(
        table_name=TABLE_NAME, # Vector size for VertexAI model(textembedding-gecko@latest)
        vector_size=768,
    )
except ProgrammingError:
    print("Table already created")

Table already created


- Execute \d+ [YOUR_INITIALS]_table in the psql shell to check the schema of the table

### II.2 Create an embedding to convert your documents

In [21]:
from langchain_google_vertexai import VertexAIEmbeddings

embedding = VertexAIEmbeddings(
    model_name=#TODO,
    project=PROJECT_ID
)

In [65]:
from langchain_google_cloud_sql_pg import PostgresVectorStore

vector_store = PostgresVectorStore.create_sync(  # Use .create() to initialize an async vector store
    engine=engine,
    table_name=TABLE_NAME,
    embedding_service=embedding,
)

In [66]:
# vector_store.add_documents(merged_documents)
# Excute only once this cell

### II.3 Perform a similarity search

In [67]:
query = "How to train a Large Language Model?"

In [75]:
retriever = vector_store.as_retriever(
    search_type=#TODO
    search_kwargs=#TODO
)

docs = retriever.get_relevant_documents(query)

In [76]:
for doc in docs:
    print("-" * 50)
    print("Content: ", doc.page_content)
    print("Metadata: ", doc.metadata)

--------------------------------------------------
Content:  I.A.7 Training Process
Training process
Steps 
Find scaling recipes (example: learning rate decrease if the size of the model increase)
Tune hyper parameters on small models of differents size
Choose the best models among the smallest ones
Train the biggest model with the 
Q. Should I use Transformers or LSTM ? 
‹#›
Stanford CS229 I Machine Learning I Building Large Language Models (LLMs) [Youtube]
II.A.3 RNN
Recurrent Neural Networks (Seq2seq model)
‹#›
II.B.1 Self Attention Mechanism
   are
Transformers
decoder
encoder
Query: What am I looking for ? 
|E| : Embedding (1, 12 288)
|WQ|: Query matrix (12 288, 128)
WQ
2.11
-4.22
..
..
5.93
2.43
-3.2
..
..
3.32
2.11
-4.22
..
..
1.12
3.11
-4.22
..
..
4.98
Embedding
3.23    -1.23    0.89    0.32
 -3.29     3.23    1.23   -2.34
  1.83     1.92    0.1     1.28
E2
E3
E1
2.11
-4.22
..
..
5.93
2.11
-4.22
..
..
5.93
2.11
-4.22
..
..
5.93
2.11
-4.22
..
..
5.93
Query		
Q1
Q2
Q3
Q4
Am I a s

**Congratulations**! You have successfully ingested the data in Cloud SQL.