# razulibs/razu documentation

### Class ```S3Storage```

This class uses boto3, the official Python SDK for AWS to interact with the S3 storage. It wraps aws api calls into class methods, adding needed parameters, exception handling and whatever other additional features are needed. It provides methods for interacting with an S3-compatible storage service. 

The subclass EDepot, combines these S3 'basic' services to implement ad hoc e-depot pipelines. 

It includes functionalities to:

- Check or create a bucket
- Upload files with metadata
- List objects in a bucket
- Retrieve file metadata
- Verify the integrity of uploads using MD5 checksums
- Update and retrieve access control lists (ACLs) of objects
- Retrieve the bucket policy
- Check and retrieve block public access settings

The class initializes an S3 client using credentials (endpoint, access key, and secret key) retrieved from environment variables. The credentials should be stored in a `.env` file. Lookup order for `.env` is:

1) The current working directory (so different projects/buckets can use their own creds)
2) Fallback to the module directory (this file's directory)

NOTE: On my client "sudo hwclock -s" is sometimes required (when clock is 'skewed')

N.B. When passing the object key as argument, it is always "nl-wbdrazu/k50907905/689/001/067/nl-wbdrazu-k50907905-689-1067806.extension", not just the last part.


**Attributes**

Attributes are loaded from an .env file that is accessed following this look up order: first from the current working directory, then fallback to the module directory.

```endpoint```: the s3 storage endpoint

```access_key```: s3 access key

```secret_key```: s3 secret key

```s3_client```: the boto3 s3 client instance. This is initialized using the endpoint, access key, and secret key. [boto3 credential](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html)

    *Issues*:

    - env vars stored as attributes [Identity and Access Management in AWS EC2](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html)
    - we want an error if the env file is not loaded correctly, missing etc.


**Methods**

```check_or_create_bucket```: Checks if a bucket exists in the S3 storage and creates it if it does not exist. Takes in input a bucket name and a boolean to enamble versioning. True if the bucket exists or was created successfully, False if an error occurred.

*this is two tasks and when true is returned you don't know if bucket existed or was created. There's a message printed, so maybe this is not a problem. I guess the logic here is: I want to make a bucket, I don't need to know if it exists already but the back end needs to make sure if the bucket already exists it is not created again.**I wonder what is the use of this function? I want to create a bucket, then maybe if the bucket already exisats we want to return something else than ture, like 'choose another name etc'**.*

*Also Why is enable_bucket_versioning False by default? If versioning is set as True, the function calls another method 'set_bucket_versioning' wich is by default set as 'Enabled' status. But **There is no confirmation of setting the versioning**.*

*When creating bucket make sure to check naming rules*

```set_bucket_versioning```: Sets the versioning status on an S3 bucket. Takes in input the name of the bucket an a string specifying the versioning status ('Enabled'-default or 'Suspended'). Returns True if versioning status was set successfully, False otherwise.

*uses [boto3 put_bucjet_versioning method](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/put_bucket_versioning.html)* 

```get_bucket_versioning```: Gets the versioning status of an S3 bucket. Takes in input the name of the bucket to check and returns the verisioning status.

```store_file```: Uploads one file to the specified S3 bucket along with its metadata. Takes in input the name of the bucket to upload to, the name of the file (objeck_key), the local path of the file to upload and a metadata dictionary. It doesn't return anything but prints out a confirmation text. 

*Takes the type of file (format) via mimetypes. I do not understand where the metadata in the parameters come from, what are they? I tuses boto3 [upload_file](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html) method*

```get_file_metadata```: Returns the metadata of a specific file (object) from an S3 bucket or None if file or bucket don't exist. Take sin input the bucket name the file name (file_key). Uses [boto3 head_object method](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/head_object.html).N.B. This only selewcts the Metadata key of the head_boject response.

*the name file_key, should maybe be object_key, sine in the previous method it is called like that. We should harmonize variable names. Why this logic of hiding the error?  '# For missing objects or buckets, return None silently so callers can treat it as "does not exist".' Additionally, for the moment I always got errors and could't retrieve metadata*

```verify_upload```: Takes in input the bucket name, the file_key (file name, object_key) and the md5 of the file. Returns nothing but prints out a confirmation text of success of failure.

*calculates the checksum of the s3 file internally, how is that computationally? Look into computingthe combined MD5 from parts without downloading.***downloading costs money?** 
*ETag is an opaque identifier assigned by a web server to a specific version of a resource found at a URL. For single uploads, it is usually the md5 checksum. For bigger files, different md5 are computed for each part. that is why the functions tries to capture that and if it's the case downloads the file and recalculate md5. [s3 data integrity](https://repost.aws/knowledge-center/data-integrity-s3) this might be useful, to read. The etag might not necessarily be the md5* 

```update_acl```: Updates the ACL (Access Control List) for a specific object in the S3 bucket. Returns nothing but prints out a success/error message. 

*The function uses by default [Canned ACL](https://docs.aws.amazon.com/AmazonS3/latest/userguide/acl-overview.html#canned-acl) set to 'public-read' and not an AccessControlList dictionary. It uses the [boto3 put_object_acl](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/put_object_acl.html) method. To know more about what different control levels entail see [s3 ACL overview](https://docs.aws.amazon.com/AmazonS3/latest/userguide/acl-overview.html). The default 'public-read'means that ALL USERS get READ permission (s3:ListBucket, s3:ListBucketVersions, and s3:ListBucketMultipartUploads	s3:GetObject and s3:GetObjectVersion), and OWNER gets FULL_CONTROL (the combination of all permissions READ,. WRITE, READ_ACP, WRITE_ACP)|* 

```get_object_acl```: Takes in input the bucket name and the file_key (object_key) and returns nothing but prints out the ACL of objects.

```get_bucket contents```: returns a list of all object keys in a bucket using pagination. Take sin input the bucket name and a 'prefix' which is a starting string of an object key (ex. nl-wbdrazu/k50907905/689 (to check), or NL-WbDRAZU-K50907905-500).

Pagination means that results are read by the 'list_object_v2' API in batches of 1000. each page = 1000 objects. The get_bucket_content method reads the pages and append all result to a uniqye list

*can take a long time, maybe we should add a status bar.*

```get_bucket_policy```: Returns the policy of a specific s3 bucket specified in input, or an error message. Uses boto3 [get_bucket_policy](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/get_bucket_policy.html).

*the k50907905 bucket does't have one*

```get_block_public_access```:Checks if Block Public Access is enabled for a specific S3 bucket. [Block Public Access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-control-block-public-access.html) is a feature that sets a centralized control to limit public access. If enable it overrides ACL settings and other set policies to always prevent public access. Returns the block bublic access configuration for the bucket specified in input.
Uses boto3 [get_public_access_block](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/get_public_access_block.html). For k50907905 it is not enabled.

*why do we change the name of the function from the boto3 one, isn't it confusing* 

```list_buckets```: Returns a list of all available buckets in the S3 storage. Uses boto3 [list_buckets](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/list_buckets.html)

```delete_bucket```: Returns a boolean value indicating whether the bucket was deleted successfully or not. To be deleted, by default the bucket must be empty. Takes in input the bucket name and a boolean to force the deletion of all objects in the bucket before deleting the bucket. 

uses [head_bucket](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/head_bucket.html) to determine if a bucket exists. then deletes the bucket using [delete_bucket](https://docs.aws.amazon.com/cli/latest/reference/s3api/delete-bucket.html)

If force=True, uses [list_object_versions](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/list_object_versions.html) to list all versioned objects in a bucket and delete them, before deleting the non-versioned objects with [delete_object]()

*adding the version id make it so that the deletion is not a new object version?* 

```delete_file```: Returns a boolean value indicating whether the file was deleted successfully or not. TTakes in input the file_key and its bucket name. Uses [delete_object](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/delete_object.html)


```_encode_metadata```: internal method to encode metadata values to handle non-ASCII characters that are not allowed or not safe in URLs. Takes in input a dictionary with metadata and returns the correspondiong url-encoded dictionary. It is an internal method, supposedly it is like that to not use it outside the class/module. It is called by the ```store_file``` method.

*why is it an internal method like this? and not a util for instance?* 

    *Issues* 

    - check exception handling of all functions
    - in 'delete objects from manifest'function check if there where versione objects that have not been deleted. Implement checking for other versions mayube as an option?


# Other to document

## razu/

the name of the class is also the name of the file(.py)

**Tasks**: create MDTO RDF metadata of resources.
- `RDFResource`: RDFResource represents an RDF node (either a URIRef or a BlankNode) along with its associated graph.
    It provides methods to add properties, handle nested data, and combine graphs.
- `MetaResource(RDFResource)`: Generates identifiers via Identifiers class, give context to the resource loading Config and all doirectories paths etc
- `StructuredMetaResource(MetaResource)`: StructuredMetaResource extends MetaResource with additional methods to handle structured data. Adds domain specific LDTO metadata
- `MetaGraph`: Sets MDTO Namespaces and prefixes
- `Identifiers`: Generate uid uri from data in the Config file
- `ConnceptResolver`: Resolves URIs for terms from a vocabulary and creates Concept objects


- `EDepot`: class and method to interact with S3 storage via boto3 library
- `Manifest` and `ManifestEntry`: A class to manage a manifests

- `Sip`: manage SIPS

[link to repo](https://github.com/Regionaal-Archief-Zuid-Utrecht/razulibs)