# **Reading and Writing Metadata**


In [None]:
def get_file_variables(self):
    variables = list(self.variables.keys())
    print(variables)

## **NetCDf**


### **Libraries Needed**


In [1]:
import xarray as xr

It's required to use `pip install xarray` or `conda install xarray`


### **Loading the Data**


In [2]:
# Specify the file name
file = "example.nc"
# Open the NetCDF dataset
nc_dataset = xr.open_dataset(file)

No module named 'click'


### **Reading Metadata**


#### **Knowing the Variables that are on The File**


In [3]:
# Get a list of variable names in the dataset
variables = list(nc_dataset.variables.keys())
# Print the list of variable names
print(variables)

['lat', 'lon', 'los', 'pair', 'ref', 'rep']


#### **Variables**


Most Used Attributes in the **Variable's Metadata**:

- **units**: Specifies the units of the variable.
- **long_name**: Provides a full name of the variable.
- **standard_name/short_name**: Follows standard naming conventions for interoperability.
- **valid_min** and **valid_max**: Define the valid value range.
- **missing_value** or **fill_value**: Indicates missing or undefined data.
- **scale_factor** and **add_offset**: Scaling parameters for physical units.
- **coordinates**: Specifies associated coordinates.
- **axis**: Identifies the variable's varying axis (e.g., 'X', 'Y', 'Z', 'time').
- **description**: Provides a detailed or additional explanation about the variable

_Note: It's not obligatory to use all of these attributes, but this are the most commonly to seen._


**Getting The Variables Metadata**


In [24]:
# Loop over variables in the dataset
for coord_name, coord_var in nc_dataset.coords.items():
    coord_name = coord_name.title()  # Improve readability
    print("Variable:", coord_name)
    if not coord_var.attrs:
        print("   No metadata was found for this variable.")
    else:
        print("   Values: ", coord_var.values)  # Print the values of the variable
        print(
            "   Attributes: ", coord_var.attrs
        )  # Print the attributes of the variable
    print()  # Give some space to improve readability

Variable: Lat
   Values:  [-7.52       -7.51916667 -7.51833333 ... -5.56833333 -5.5675
 -5.56666667]
   Attributes:  {'Units': 'Degrees', 'Long_Name': 'Latitude', 'Short_Name': 'lat', 'Valid_min': '-90', 'Valid_Max': '90', 'missing_Value': '-9999', 'Fill_Value': '-9999', 'Scale_Factor': '1.0', 'add_offset': '0.0', 'Coordinates': 'longitude', 'Axis': 'Y', 'Description': 'The angular distance between the north and south from the equator'}

Variable: Lon
   No metadata was found for this variable.

Variable: Pair
   No metadata was found for this variable.

Variable: Ref
   No metadata was found for this variable.

Variable: Rep
   No metadata was found for this variable.



**Another possible way to visualize the metadata is**:


In [5]:
# Loop over the variables in the dataset
for coord_name, coord_var in nc_dataset.coords.items():
    coord_name = coord_name.title()  # Improve readability
    print("Variable:", coord_name)
    if not coord_var.attrs:
        print("   No metadata was found for this variable.")
    else:
        print(
            "   ", coord_var
        )  # Print the coordinate variable object, which includes values and attributes
    print()

Variable: Lat
   No metadata was found for this variable.

Variable: Lon
   No metadata was found for this variable.

Variable: Pair
   No metadata was found for this variable.

Variable: Ref
   No metadata was found for this variable.

Variable: Rep
   No metadata was found for this variable.



My preference to visualize the metadata is in way the first code displays it, because the second one sometimes will truncate the output and because of that we can't visualize all metadata. That's way I prefer the first one, but both work to visualize the metadata in the variable.


##### **Print The Values of specific Variable**


Will display the latitude values along with the associated attributes.


In [1]:
import suntzu as snt
teste = snt.read_file("./Employee_Cleaned.parquet")
teste.get_best_dtypes(show_df=True)

  cls = super().__new__(mcls, name, bases, namespace, **kwargs)


Unnamed: 0,Column_Name,Dtype,Best_Dtype
0,EmployeeIdentification,int16,int16
1,Age,int8,int8
2,Attrition,category,category
3,BusinessTravel,category,category
4,DailyRate,int16,int16
5,Department,category,category
6,DistanceFromHome,int8,int8
7,Education,category,category
8,EducationField,category,category
9,EmployeeCount,int16,int16


In [6]:
# print(nc_dataset['var_name'] )#var_name = Name of The Variable that you want

#### **Global Attributes**


Global metadata refers to attributes that apply to the entire dataset. These attributes will provide information about the whole dataset.
Most Used **Global Attributes**:

- **title**: A title or brief description of the dataset.
- **institution**: The institution responsible for the dataset.
- **source**: Describes the data source or generation method.
- **history**: Records modifications or processing steps applied to the dataset.
- **references**: Provides references to relevant publications or resources.
- **Conventions**: Specifies the formatting conventions of the NetCDF file.
- **creator/author**: Identifies the creator of the dataset.
- **project**: Describes the associated project or research program.
- **license**: Specifies the usage terms for the dataset.


**Getting the Global Attributes**


In [7]:
# Get the global attributes of the dataset
global_attrs = nc_dataset.attrs
# Check if the global attributes are present
if not global_attrs:
    print("No Global Attributes were found.")
else:
    # If there are it will loop over each global attribute and print it's name and value
    for attr_name, attr_value in global_attrs.items():
        attr_name = attr_name.title()  # Improve readability
        print(attr_name, ":", attr_value)

No Global Attributes were found.


### **Writing Metadata**


#### **Variables**


Attributes that will be used:

- **units**: Specifies the units of the variable.
- **long_name**: Provides a full name of the variable.
- **standard_name/short_name**: Follows standard naming conventions for interoperability.
- **valid_min** and **valid_max**: Define the valid value range.
- **missing_value** or **fill_value**: Indicates missing or undefined data.
- **scale_factor** and **add_offset**: Scaling parameters for physical units.
- **coordinates**: Specifies associated coordinates.
- **axis**: Identifies the variable's varying axis (e.g., 'X', 'Y', 'Z', 'time').
- **description**: Provides a detailed or additional explanation about the variable


In [8]:
# Dummy Data for the attributes in the latitude variable
nc_dataset["lat"].attrs["Units"] = "Degrees"
nc_dataset["lat"].attrs["Long_Name"] = "Latitude"
nc_dataset["lat"].attrs["Short_Name"] = "lat"
nc_dataset["lat"].attrs["Valid_min"] = "-90"
nc_dataset["lat"].attrs["Valid_Max"] = "90"
nc_dataset["lat"].attrs["missing_Value"] = "-9999"
nc_dataset["lat"].attrs["Fill_Value"] = "-9999"
nc_dataset["lat"].attrs["Scale_Factor"] = "1.0"
nc_dataset["lat"].attrs["add_offset"] = "0.0"
nc_dataset["lat"].attrs["Coordinates"] = "longitude"
nc_dataset["lat"].attrs["Axis"] = "Y"
nc_dataset["lat"].attrs[
    "Description"
] = "The angular distance between the north and south from the equator"
# Print the updated latitude metadata
print(nc_dataset["lat"])

<xarray.DataArray 'lat' (lat: 2345)>
array([-7.52    , -7.519167, -7.518333, ..., -5.568333, -5.5675  , -5.566667])
Coordinates:
  * lat      (lat) float64 -7.52 -7.519 -7.518 -7.518 ... -5.568 -5.567 -5.567
    pair     object ...
    ref      object ...
    rep      object ...
Attributes:
    Units:          Degrees
    Long_Name:      Latitude
    Short_Name:     lat
    Valid_min:      -90
    Valid_Max:      90
    missing_Value:  -9999
    Fill_Value:     -9999
    Scale_Factor:   1.0
    add_offset:     0.0
    Coordinates:    longitude
    Axis:           Y
    Description:    The angular distance between the north and south from the...


#### **Global Attributes**


Attributes that will be used:

- **title**: A title or brief description of the dataset.
- **institution**: The institution responsible for the dataset.
- **source**: Describes the data source or generation method.
- **history**: Records modifications or processing steps applied to the dataset.
- **references**: Provides references to relevant publications or resources.
- **Conventions**: Specifies the formatting conventions of the NetCDF file.
- **creator/author**: Identifies the creator of the dataset.
- **project**: Describes the associated project or research program.
- **license**: Specifies the usage terms for the dataset.


In [9]:
# Dummy data for global attributes
nc_dataset.attrs["Title"] = "Example Dataset"
nc_dataset.attrs["Institution"] = "Example Institution"
nc_dataset.attrs["Source"] = "Trust Me"
nc_dataset.attrs["History"] = "Created on 33th of February of 2011"
nc_dataset.attrs["References"] = "Example References"
nc_dataset.attrs["Conventions"] = "CF-1.8"
nc_dataset.attrs["Creator_Author"] = "Asato Asato"
nc_dataset.attrs["Project"] = "Example Project"
nc_dataset.attrs[
    "Description"
] = "Example of the Example in a Example Dataset for Example purposes"
# Print the updated dataset
print(nc_dataset)

<xarray.Dataset>
Dimensions:  (lat: 2345, lon: 3062)
Coordinates:
  * lat      (lat) float64 -7.52 -7.519 -7.518 -7.518 ... -5.568 -5.567 -5.567
  * lon      (lon) float64 -79.5 -79.49 -79.49 -79.49 ... -76.95 -76.95 -76.94
    pair     object ...
    ref      object ...
    rep      object ...
Data variables:
    los      (lat, lon) float32 ...
Attributes:
    Title:           Example Dataset
    Institution:     Example Institution
    Source:          Trust Me
    History:         Created on 33th of February of 2011
    References:      Example References
    Conventions:     CF-1.8
    Creator_Author:  Asato Asato
    Project:         Example Project
    Description:     Example of the Example in a Example Dataset for Example ...


### **Checking The Updated Version Of The Metadata**


#### **Variables Metadata**


In [10]:
# Loop over variables in the dataset
for coord_name, coord_var in nc_dataset.coords.items():
    coord_name = coord_name.title()  # Improve readability
    print("Variable:", coord_name)
    if not coord_var.attrs:
        print("   No metadata was found for this variable.")
    else:
        print("   Values: ", coord_var.values)  # Print the values of the variable
        print(
            "   Attributes: ", coord_var.attrs
        )  # Print the attributes of the variable
    print()  # Give some space to improve readability

Variable: Lat
   Values:  [-7.52       -7.51916667 -7.51833333 ... -5.56833333 -5.5675
 -5.56666667]
   Attributes:  {'Units': 'Degrees', 'Long_Name': 'Latitude', 'Short_Name': 'lat', 'Valid_min': '-90', 'Valid_Max': '90', 'missing_Value': '-9999', 'Fill_Value': '-9999', 'Scale_Factor': '1.0', 'add_offset': '0.0', 'Coordinates': 'longitude', 'Axis': 'Y', 'Description': 'The angular distance between the north and south from the equator'}

Variable: Lon
   No metadata was found for this variable.

Variable: Pair
   No metadata was found for this variable.

Variable: Ref
   No metadata was found for this variable.

Variable: Rep
   No metadata was found for this variable.



#### **Global Attributes**


In [11]:
# Get the global attributes of the dataset
global_attrs = nc_dataset.attrs
# Check if the global attributes are present
if not global_attrs:
    print("No Global Attributes were found.")
else:
    # If there are it will loop over each global attribute and print it's name and value
    for attr_name, attr_value in global_attrs.items():
        attr_name = attr_name.title()  # Improve readability
        print(attr_name, ":", attr_value)
        print()  # Give some space to improve readability

Title : Example Dataset

Institution : Example Institution

Source : Trust Me

History : Created on 33th of February of 2011

References : Example References

Conventions : CF-1.8

Creator_Author : Asato Asato

Project : Example Project

Description : Example of the Example in a Example Dataset for Example purposes



### **Saving The File**


In [12]:
new_file = "updated_example.nc"
nc_dataset.to_netcdf(new_file, format="netCDF4")

## **Parquet**


### **Libraries Needed**


In [13]:
import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd

It's required to use `pip install pyarrow pandas` or `conda install pyarrow pandas`


### **Loading the Data**


We used df.info() to check the columns name which will be useful later


In [26]:
# Reading the Parquet file into a DataFrame
df = pd.read_parquet("Titanic.parquet")
# Displaying information about the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   Gender               891 non-null    category
 1   Age                  714 non-null    float64 
 2   Siblings_on_Board    891 non-null    int8    
 3   Parents_on_Board     891 non-null    int8    
 4   Ticket_Price         891 non-null    float64 
 5   Port_of_Embarkation  889 non-null    category
 6   Class                891 non-null    category
 7   Adult/Child          891 non-null    category
 8   Alone                891 non-null    bool    
 9   Survived             891 non-null    int64   
dtypes: bool(1), category(4), float64(2), int64(1), int8(2)
memory usage: 27.6 KB


### **Converting a Pandas DataFrame To a Pyarrow Table**


In [15]:
table = pa.Table.from_pandas(df)  # Converting the pandas DataFrame to a PyArrow Table

#### **Writing The Metadata**


In [16]:
# Define the schema for the PyArrow Table
table_schema = pa.schema(
    [
        pa.field(
            "Gender", "string", metadata={"Description": "The passenger's Gender"}
        ),
        pa.field(
            "Age",
            "string",
            metadata={"Description": "The passenger's Age", "Calculation": "No"},
        ),
        pa.field(
            "Siblings_on_Board",
            "int8",
            metadata={
                "Description": "Number of sibilings that the passenger had on board",
                "Calculation": "No",
            },
        ),
        pa.field(
            "Parents_on_Board",
            "int8",
            metadata={
                "Description": "Number of parents that the passenger had on board",
                "Calculation": "No",
            },
        ),
        pa.field(
            "Ticket_Price",
            "float64",
            metadata={"Description": "Ticket's Price", "Calculation": "No"},
        ),
        pa.field(
            "Port_of_Embarkation",
            "string",
            metadata={"Description": "The port were the passenger embarked"},
        ),
        pa.field(
            "Class",
            "string",
            metadata={"Description": "The passenger's class on the ship"},
        ),
        pa.field(
            "Adult/Child",
            "string",
            metadata={"Description": "If the passenger is child or not"},
        ),
        pa.field(
            "Alone",
            "bool",
            metadata={"Description": "If the passenger is travelling alone or not"},
        ),
        pa.field(
            "Survived",
            "int64",
            metadata={
                "Description": "If the passenger survived or not",
                "Calculation": "No",
            },
        ),
    ]
)

The schema is defined with a list of pa.field() objects, where each field represents a column in the table. Each field specifies the column name, data type, and metadata.

The metadata is defined as a dictionary within the metadata parameter of each field. It provides additional information about the column, such as descriptions or calculations.


**_Some Notes:_**

- _In the first two text fields(column name and data type) you need to write correctly the columns name and dtype. If you spell it wrong it will give a error._
- _'Calculation: No' means if the values in that column were obtained by some calculation, this is normally described in column which the dtype is numerical_
- _You can write as many metadata that you want but it needs to be saved as dict_


#### **Saving the new schema(metadata) into the Table**


In [17]:
table = table.cast(table_schema)  # Cast the PyArrow Table to the specified schema

**Visualize the Modifications**


In [18]:
table.schema  # Retrieve the schema of the PyArrow Table

Gender: string
  -- field metadata --
  Description: 'The passenger's Gender'
Age: string
  -- field metadata --
  Description: 'The passenger's Age'
  Calculation: 'No'
Siblings_on_Board: int8
  -- field metadata --
  Description: 'Number of sibilings that the passenger had on board'
  Calculation: 'No'
Parents_on_Board: int8
  -- field metadata --
  Description: 'Number of parents that the passenger had on board'
  Calculation: 'No'
Ticket_Price: double
  -- field metadata --
  Description: 'Ticket's Price'
  Calculation: 'No'
Port_of_Embarkation: string
  -- field metadata --
  Description: 'The port were the passenger embarked'
Class: string
  -- field metadata --
  Description: 'The passenger's class on the ship'
Adult/Child: string
  -- field metadata --
  Description: 'If the passenger is child or not'
Alone: bool
  -- field metadata --
  Description: 'If the passenger is travelling alone or not'
Survived: int64
  -- field metadata --
  Description: 'If the passenger survived or

### **Saving The File**


In [19]:
# Define the output file path and name
output_file = "updated_Titanic.parquet"
# Write the PyArrow Table to a Parquet file
pq.write_table(table, output_file)

### **Reading the Updated File**


In [21]:
# Read the updated Parquet file into a PyArrow Table
updated_pq_dataset = pq.read_table("updated_Titanic.parquet")
# Retrieve the schema of the updated PyArrow Table
updated_pq_dataset.schema

Gender: string
  -- field metadata --
  Description: 'The passenger's Gender'
Age: string
  -- field metadata --
  Description: 'The passenger's Age'
  Calculation: 'No'
Siblings_on_Board: int8
  -- field metadata --
  Description: 'Number of sibilings that the passenger had on board'
  Calculation: 'No'
Parents_on_Board: int8
  -- field metadata --
  Description: 'Number of parents that the passenger had on board'
  Calculation: 'No'
Ticket_Price: double
  -- field metadata --
  Description: 'Ticket's Price'
  Calculation: 'No'
Port_of_Embarkation: string
  -- field metadata --
  Description: 'The port were the passenger embarked'
Class: string
  -- field metadata --
  Description: 'The passenger's class on the ship'
Adult/Child: string
  -- field metadata --
  Description: 'If the passenger is child or not'
Alone: bool
  -- field metadata --
  Description: 'If the passenger is travelling alone or not'
Survived: int64
  -- field metadata --
  Description: 'If the passenger survived or

In [2]:
import polars as pl

updated_pq_dataset = pl.read_parquet_schema("updated_Titanic.parquet")
updated_pq_dataset

{'Gender': Utf8,
 'Age': Utf8,
 'Siblings_on_Board': Int8,
 'Parents_on_Board': Int8,
 'Ticket_Price': Float64,
 'Port_of_Embarkation': Utf8,
 'Class': Utf8,
 'Adult/Child': Utf8,
 'Alone': Boolean,
 'Survived': Int64}