In [None]:
import io
import os
import gzip
import boto3
import sqlite3
import pandas as pd
from botocore import UNSIGNED
from botocore.config import Config
from langchain.chat_models import ChatOpenAI

## **Data Engineering**
Data engineering is a critical process in the loading and preparing data for analysis by ensuring its accessibility, quality, and structure.
In the modern day and age - this is especially pertinent when it comes to integrating with databases on cloud-based platforms - such as AWS!!  

In this section, we focus on how boto3, the AWS Software Development Kit (SDK) for Python, facilitates integration with Amazon S3 to access, download, and process data. By leveraging AWS services, we can efficiently manage large datasets, even those stored in compressed formats like `.db.gz`!

In [None]:
# Create an anonymous S3 client (DISABLE SIGNATURES) --> ONLY for public datasets
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))

# Choice of (Public) S3 bucket (amend as you choose to!)
bucket_name = "megascenes"  

In [None]:
s3_response = s3.list_objects_v2(Bucket=bucket_name, MaxKeys = 20)  # Fetch first 20 files

# Print the available file keys --> files which we can choose to access
if "Contents" in s3_response:
    print("Files in the bucket:")
    for obj in s3_response["Contents"]:
        print("-", obj["Key"])
else:
    print("No files found or access denied.")

In [None]:
db_bucket = list(map(lambda x:x["Key"],s3.list_objects_v2(Bucket=bucket_name)["Contents"])) 
# Similar as above - only this time no max limit
print(len(db_bucket))
db_bucket[:20]

In [None]:
file_key = 'databases/descriptors/000/000/descriptors.db.gz' # Amend as you wish!
# Read the file from S3 into memory
obj_response = s3.get_object(Bucket=bucket_name, Key=file_key)
compressed_data = obj_response["Body"].read()

# Decompress the data
with gzip.GzipFile(fileobj=io.BytesIO(compressed_data)) as f:
    decompressed_data = f.read()

# Write to an in-memory database
db_buffer = io.BytesIO(decompressed_data)

# Connect SQLite to the in-memory buffer - no files being downloaded yet
conn = sqlite3.connect(":memory:",)

In [None]:
# Load the decompressed database into .db file for SQLITE to connect to
with open("temp.db", "wb") as temp_db:
    temp_db.write(db_buffer.getvalue())

# Reconnect to the temporary SQLite database
conn = sqlite3.connect("temp.db")

In [None]:
table_names = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table'",conn) # Get table names of those availabe in the database
table_names

In [None]:
# Read the table into pandas
df = pd.read_sql_query("SELECT * FROM descriptors", conn)
conn.close()

## **GenAI & Data Extraction**

The code listed above is a pretty standard way to extract data from an AWS bucket (one that is publicly available) to one's local machine. Yet coding it out time and time again for various datasets from a number of buckets is bound to be time consuming and not supremely productive. 

Indeed - it would be thus preferential to have someone/ something else handle this for us as Data Engineers - and that's where AI comes in! However, the task is not truly as simple as passing in the url of interest and asking 'Chat - please read this for me, thank you' (as much as one may wish that is the case). 

Thus - the data engineer must use his/her prior knowledge to instruct the model clearly on what to do to produce reliable outputs as need be - and also be prepared to troubleshoot it wherever necessary! 

In [None]:
with open("SDS_OpenAI_key.txt", "r") as f:
    api_key = f.read().strip()
    os.environ["OPENAI_API_KEY"] = api_key

llm = ChatOpenAI(model_name="gpt-4", temperature=0.001) # the temperature setting can be thought of as a way to set the variability of the 
# result generated - as reliable code is expected as an output - the number passed into it is fairly low to get *somewhat* consistent
# outputs of code after a number of runs

In [None]:
## Attempt to craft a specific prompt (iteratively improve it by analysing the output produced and how the code actually runs)
#  so as to be able to extract the file from AWS onto one in a pandas DataFrame

prompt = f"""Chat, given the available s3 database
with bucket {bucket_name} and file {file_key} 
help me read the file key - give me the code in Python please.
"""

# Improve this prompt as need be!!

In [None]:
chat_response = llm.invoke(prompt)
print(chat_response.content)

In [None]:
### Insert the code from GPT here!!!

""" 
FOR AI CODE
"""

In [None]:
# Removal of files from memory
os.remove("temp.db")
os.remove("descriptors.db")
os.remove("descriptors.db.gz")

## **GenAI & Transformation**

As we have explored previously, GenAI can be a very useful tool for a data engineer to craft applicable code for the parsing of databases from cloud base sources - especially when a good prompt is used. But that is not the most value adding stage for GenAI on its own in the ETL pipeline/ in pipelining in General. 

Rather - GenAI can be integrated in the Transformation stage on our data, as well as on the Analytical stage - to offer insights to ways in which our data can be transformed (ideally beyond the standard ways of imputing missing values/ changing datatypes) for our benefit!

In [None]:
df_prev = pd.read_csv(r"data\yellow_tripdata_2019-01.csv")
df_recent = pd.read_csv(r"data\yellow_tripdata_2020-01.csv")

In [None]:
df_recent.sample(100) # sample of the `recent` dataframe 

In [None]:
nrows = 20
prompt = f"""Chat, given the dataframes {df_prev.sample(nrows)} and
{df_recent.sample(nrows)}, what are some things I can do with them
"""
# Improve the aforementioned prompt to enable GenAI to grant you some ideas to what transformations you can implement!!

In [None]:
chat_response = llm.invoke(prompt)
print(chat_response.content)

In [None]:
## Improve this prompt to output the relevant code for wild/wacky tranformations that can be applied on your dataframe 
## Craft your prompt in such a way that the code granted by GPT requires limited tuning/ modification for optimal performance!!

prompt = f"""Chat, given the dataframe {df_recent.sample(10)} with column names {df_recent.columns} tell me how to apply some transformations.
"""

In [None]:
chat_response = llm.invoke(prompt)
print(chat_response.content)

In [None]:
### Insert the code from GPT here!!!

""" 
FOR AI CODE
"""