#**This notebook explains - Uploading the csv file from Snowflake and the text files into the AWS S3 bucket**

##1. Setting Up Snowflake

#### **Pre-requisites:**

Installing snowflake-connector-python and boto3
Before running the script, ensure that your environment is set up with the necessary Python packages.

***Snowflake Connector for Python (snowflake-connector-python):*** This package allows to communicate with Snowflake from Python, executing queries and handling results.

***Boto3:*** This is the Amazon Web Services (AWS) SDK for Python. It enables Python developers to create, configure, and manage AWS services, such as S3.

In [None]:
#installing snowflake-connector-python and boto3 in cmd/ python book

!pip install snowflake-connector-python boto3

Collecting boto3
  Downloading boto3-1.34.43-py3-none-any.whl (139 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting botocore<1.35.0,>=1.34.43 (from boto3)
  Downloading botocore-1.34.43-py3-none-any.whl (12.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jmespath<2.0.0,>=0.7.1 (from boto3)
  Downloading jmespath-1.0.1-py3-none-any.whl (20 kB)
Collecting s3transfer<0.11.0,>=0.10.0 (from boto3)
  Downloading s3transfer-0.10.0-py3-none-any.whl (82 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.1/82.1 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jmespath, botocore, s3transfer, boto3
Successfully installed boto3-1.34.43 botocore-1.34.43 jmespath-1.0.1 s3transfer-0.10.0


###**Creating an S3 Stage in Snowflake**

An **S3 stage** in Snowflake is a reference or pointer to an external location on Amazon S3 where you can stage (store temporarily) data files that are used for loading data into Snowflake or unloading data from Snowflake. It simplifies the process of specifying S3 paths in your SQL commands by allowing you to use the stage name instead of the full S3 path.

To create an **S3 stage** that points to the S3 bucket, use Snowflake web interface or execute SQl command as below:


In [None]:
CREATE STAGE amos_s3_stage
  URL='s3://assign2amos'
  CREDENTIALS=(AWS_KEY_ID='<my_access_key>' AWS_SECRET_KEY='<my_secret_key>')
  FILE_FORMAT = (TYPE = 'CSV');

##2. Setting up AWS

The setup includes creating an AWS account, setting up an S3 bucket, creating an IAM user with the necessary permissions.




### Step 1: Sign Up for AWS


1. **Create an AWS Account:** Go to AWS homepage (https://aws.amazon.com/) and sign up.
2. **Log in to the AWS Management Console:** Once the account is set up, log in to the AWS Management Console.

### Step 2: Create an S3 Bucket


1. **Go to S3 Service:** In the AWS Management Console, under Service menu find **S3**
2. **Create a New Bucket:** Click **Create bucket**. Complete the creation of Bucket by following the instructions on-screen
3. **Review and Create:** Review the settings and click **Create bucket**.
4. **Bucket Name:** assign2amos

### Step 3: Create an IAM User and Assign Permissions


1. **Go to the IAM Service:** In the AWS Management Console, under the Service menu, find **IAM**
2. **Create a New User:** In the navigation pane, click **Users**-->**Add user**. Provide a user name - **AMOS-teammates** and select Programmatic access for the AWS access type.
3. **Attach Policies for S3 Access:** On the permissions page, select **Attach existing policies directly**. Search and select the `AmazonS3FullAccess` policy, or select the customized policy created - ***Policy_AMOS_uploadData.***
4. **Review and Create:** Review the user details and permissions, then click on **Create user**. The AMOS-teammates user is now created.
5. **Download Credentials:** After the user is created, download the **Access Key ID** and **Secret Access Key**. Save these credentials securel for later use.

### Step 4: Configure AWS CLI with Your Credentials


  1. **Install AWS CLI:** Download and install the AWS Command Line Interface (CLI)
  2. **Configure the CLI:**
    * Open a terminal or command prompt and run `aws configure`.
    * Enter the access key ID and secret access key when prompted. Also, specify the default region name and output format (e.g., `us-west-2` and `json`).
   ```
   aws configure
   AWS Access Key ID [None]: YOUR_ACCESS_KEY_ID
   AWS Secret Access Key [None]: YOUR_SECRET_ACCESS_KEY
   Default region name [None]: YOUR_PREFERRED_REGION
   Default output format [None]: json
   ```
  3. **Verify Configuration:** Verify the  configuration, run `aws s3 ls` which will list the S3 buckets in your AWS


### Step 5: Run Your Python Script


After completion of the above steps, the AWS environment is ready.
Now run the Python script below to upload files to your S3 bucket.

####Uploading CSV file from Snowflake database to AWS S3 bucket

In [None]:
#Code to upload the csv file from SNF to the AWS S3 bucket

#working SNF to AWS S3 bucket Code:

import snowflake.connector
from snowflake.connector.pandas_tools import write_pandas

# Snowflake connection parameters
sf_account = 'CKMFZWO-AFA59273'
sf_user = 'AKSHITAPATHANIA'
sf_password = '<my_snf_password>'
sf_warehouse = 'TEXT_EXTRACTION_WH'
sf_database = 'TEXT_EXTRACTION'
sf_schema = 'PUBLIC'
sf_role = 'ACCOUNTADMIN'

# AWS S3 parameters
aws_access_key_id = '<my_access_key>'
aws_secret_access_key = '<my-secret_key>'
bucket_name = 'assign2amos'  #S3 bucket name
s3_stage = 'amos_s3_stage'  # defined in Snowflake
s3_key = 'CFA_SNF_Dataset/SNF_dataSet.csv'  # desired S3 key

# SQL query to select data from Snowflake
sql_query = """
SELECT *
FROM TEXT_EXTRACTION.PUBLIC.STRUCTURED_DATA;
"""

# Connecting to Snowflake
ctx = snowflake.connector.connect(
    user=sf_user,
    password=sf_password,
    account=sf_account,
    warehouse=sf_warehouse,
    database=sf_database,
    schema=sf_schema,
    role=sf_role,
)

# Creating a cursor object
cur = ctx.cursor()

try:
    # Use Snowflake's COPY INTO command to unload data to S3
    unload_query = f"""
    COPY INTO 's3://{bucket_name}/{s3_key}'
    FROM TEXT_EXTRACTION.PUBLIC.STRUCTURED_DATA
    CREDENTIALS = (AWS_KEY_ID='{aws_access_key_id}' AWS_SECRET_KEY='{aws_secret_access_key}')
    FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY = '\"')
    OVERWRITE = TRUE;
    """

    # Executing the unload query
    cur.execute(unload_query)

    print(f"Data successfully unloaded to S3 bucket '{bucket_name}' with key '{s3_key}'.")

finally:
    # Closing the cursor and connection
    cur.close()
    ctx.close()

Data successfully unloaded to S3 bucket 'assign2amos' with key 'CFA_SNF_Dataset/SNF_dataSet.csv'.


####Uploading extracted text files from Grobid to AWS S3 Bucket

In [None]:
import boto3

def upload_files_to_s3(file_paths, aws_access_key_id, aws_secret_access_key, bucket_name, s3_keys):
    """
    Uploads multiple files to an AWS S3 bucket.

    Args:
    - file_paths (list): Paths to the files to be uploaded.
    - aws_access_key_id (str): AWS access key ID.
    - aws_secret_access_key (str): AWS secret access key.
    - bucket_name (str): Name of the AWS S3 bucket.
    - s3_keys (list): The S3 keys (paths) for the uploaded files.
    """
    # Initialize S3 client
    s3_client = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key)

    for file_path, s3_key in zip(file_paths, s3_keys):
        # Upload file to S3
        s3_client.upload_file(file_path, bucket_name, s3_key)
        print(f"File '{file_path}' uploaded to S3 bucket '{bucket_name}' with key '{s3_key}'.")

# Assuming these files are text files and located in the current directory
file_paths = [
    'Grobid_RR_2024_l1_combined.txt',
    'Grobid_RR_2024_l2_combined.txt',
    'Grobid_RR_2024_l3_combined.txt'
]

s3_keys = [
    'Grobid_Dataset/Grobid_RR_2024_l1_combined.txt',  # S3 keys for the Grodib data files
    'Grobid_Dataset/Grobid_RR_2024_l2_combined.txt',
    'Grobid_Dataset/Grobid_RR_2024_l3_combined.txt'
]

aws_access_key_id = '<my_access_key>'  # AWS access key ID
aws_secret_access_key = '<my_secret_key>'  # AWS secret access key
bucket_name = 'assign2amos'  # S3 bucket name


upload_files_to_s3(file_paths, aws_access_key_id, aws_secret_access_key, bucket_name, s3_keys)


####Uploading extracted text files from PyPDF to AWS S3 Bucket

In [None]:
#Code to upload the text files - Grobid to the AWS S3 bucket

import boto3

def upload_files_to_s3(file_paths, aws_access_key_id, aws_secret_access_key, bucket_name, s3_keys):
    """
    Uploads multiple files to an AWS S3 bucket.

    Args:
    - file_paths (list): Paths to the files to be uploaded.
    - aws_access_key_id (str): AWS access key ID.
    - aws_secret_access_key (str): AWS secret access key.
    - bucket_name (str): Name of the AWS S3 bucket.
    - s3_keys (list): The S3 keys (paths) for the uploaded files.
    """
    # Initializing S3 client
    s3_client = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key)

    for file_path, s3_key in zip(file_paths, s3_keys):
        # Uploading file to S3
        s3_client.upload_file(file_path, bucket_name, s3_key)
        print(f"File '{file_path}' uploaded to S3 bucket '{bucket_name}' with key '{s3_key}'.")

# Updating these paths if your files are located in a specific directory
file_paths = [
    'Pypdf_RR_2024_l1_combined.txt.txt',  # Adjusted for your specific files
    'Pypdf_RR_2024_l2_combined.txt',
    'Pypdf_RR_2024_l3_combined.txt.txt'
]

s3_keys = [
    'pyPDF_Dataset/Pypdf_RR_2024_l1_combined.txt',  # Adjusted S3 keys for the new files
    'pyPDF_Dataset/Pypdf_RR_2024_l2_combined.txt',
    'pyPDF_Dataset/Pypdf_RR_2024_l3_combined.txt'
]

aws_access_key_id = '<my_access_key>'  # AWS access key ID
aws_secret_access_key = '<my_secret_key>'  # AWS secret access key
bucket_name = 'assign2amos'  # S3 bucket name

upload_files_to_s3(file_paths, aws_access_key_id, aws_secret_access_key, bucket_name, s3_keys)

Utilizing SQLAlchemy to upload the structured metadata from
Grobid including the link to the uploaded text file (from S3) into a
Snowflake database.

In [None]:
!pip install snowflake-sqlalchemy



In [None]:
from sqlalchemy import create_engine
from snowflake.sqlalchemy import URL

engine = create_engine(URL(
    user='akshitapathania',
    password='<my_snf_password>',
    account='CKMFZWO-AFA59273',
    warehouse='GROBID_METADATA_WH',
    database='GROBID_METADATA',
    schema='PUBLIC'
))


In [None]:
from sqlalchemy import Column, Integer, String, Sequence, create_engine
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class GrobidDocumentMetadata(Base):
    __tablename__ = 'GROBID_DOCUMENT_METADATA'
    id = Column(Integer, Sequence('doc_id_seq'), primary_key=True)
    document_name = Column(String)
    s3_link = Column(String)
    additional_metadata = Column(String)


In [None]:
Base.metadata.create_all(engine)

In [None]:
from sqlalchemy.orm import sessionmaker

# Function to generate S3 link
def generate_s3_link(bucket_name, s3_key):
    return f"s3://{bucket_name}/{s3_key}"


# Creating a session
Session = sessionmaker(bind=engine)
session = Session()


bucket_name = 'assign2amos'

# Metadata records to insert, based on your file upload code
metadata_records = [
    GrobidDocumentMetadata(
        document_name='Grobid_RR_2024_l1_combined.txt',
        s3_link=generate_s3_link(bucket_name, 'Grobid_Dataset/Grobid_RR_2024_l1_combined.txt'), #file 1 from grobid
        additional_metadata='{}'
    ),
    GrobidDocumentMetadata(
        document_name='Grobid_RR_2024_l2_combined.txt',
        s3_link=generate_s3_link(bucket_name, 'Grobid_Dataset/Grobid_RR_2024_l2_combined.txt'), #file 2 from grobid
        additional_metadata='{}'
    ),
    GrobidDocumentMetadata(
        document_name='Grobid_RR_2024_l3_combined.txt',
        s3_link=generate_s3_link(bucket_name, 'Grobid_Dataset/Grobid_RR_2024_l3_combined.txt'), #file 3 from grobid
        additional_metadata='{}'
    )
]

# Inserting the records into the metadata table
session.add_all(metadata_records)
session.commit()

# Closing the session
session.close()

This script creates entries in our Snowflake database for each document we've uploaded to S3, including their names, S3 links, and placeholders for any additional metadata from Grobid.