En este archivo puedes escribir lo que estimes conveniente. Te recomendamos detallar tu solución y todas las suposiciones que estás considerando. Aquí puedes ejecutar las funciones que definiste en los otros archivos de la carpeta src, medir el tiempo, memoria, etc.

This is a not very optimal way to measure the memory consumed by a file. It is not very optimal because it does not execute the file but rather gives an approximation of how much it could be.

In [18]:
import psutil
import subprocess
import time

t_process = subprocess.Popen(["python", "/Users/mema/Documents/Projects/latam_challenge/src/q2/time.py"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
m_process = subprocess.Popen(["python", "/Users/mema/Documents/Projects/latam_challenge/src/q2/memory.py"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
time.sleep(1)

try:
     t_memory = psutil.Process(t_process.pid).memory_info()
     m_memory = psutil.Process(m_process.pid).memory_info()
     print(f"Memory used by q1_time.py: {t_memoria.rss/1024} Kilobytes")
     print(f"Memory used by q1:memory.py: {m_memoria.rss/1024} Kilobytes")
except psutil.NoSuchProcess as e:
     print(f"The process has already finished. {e}")



The process has already finished. PID still exists but it's a zombie (pid=88440)


Another way to measure memory and which is the one that measures memory in each of the functionalities is using memory_usage from memory-profiler.

NOTE:
1. It will not work if run on a notebook because '__main__' will not be found and will generate an error.
2. partial is used to create a partial version of a function with some predefined arguments, which can be useful in certain situations where you need to reuse a function with common predefined arguments. Memory_usage is then used to measure the memory usage of that partial function.

In [None]:
from memory_profiler import memory_usage
from functools import partial

def sum(n1:int, n2:int):
    return n1 + n2

mem_partial = partial(sum, n1=2, n2=3)
mem_usage = memory_usage(mem_partial)

print(f"Memory used during the process {mem_usage[0]}, KB")

The way execution time is measured is quite simple. We simply must initialize a timer at the beginning and another at the end of the execution and at the end the differences are subtracted. This gives us a fairly precise time in terms of seconds.

In [16]:
import time

def func():
    time.sleep(5) 
        

start_processing_time = time.time()
func()
end_processing_time = time.time()

total_processing_time = end_processing_time - start_processing_time
print(f"Total time processing: {total_processing_time}, sec")

Total time processing: 5.00519323348999, sec


Most of the activities use libraries such as Counter, Pandas and Dask. In addition, Dataframes from both Pandas and Dask are widely used.

Counter:
Counter is a class in the Python standard library used to count hashable objects. Provides a simple way to count the occurrence of elements in a sequence, such as a list or string.
Typical use: Commonly used to count the frequency of elements in a list, string, or other sequence.

Pandas:
Pandas is a Python library used for data manipulation and analysis. It provides powerful and flexible data structures, such as DataFrame, that allows you to work with data in a tabular form and perform operations efficiently.
Typical Usage: Used to load, clean, transform and analyze tabular data such as CSV, Excel, SQL, etc. data sets.

Dash:
Dask is a Python library used for parallel and distributed computing. It provides flexible data structures, such as Dask Array and Dask DataFrame, that extend the functionality of NumPy and Pandas to work with larger data sets that cannot fit in the RAM of a single machine.
Typical Use: Used when working with large data sets that cannot be handled with Pandas due to memory limitations, or when you need to parallelize operations on large data sets to improve performance.
Dask DataFrame vs Pandas DataFrame:

Dask DataFrame: It is a distributed and parallelizable data structure that resembles a Pandas DataFrame. It allows you to work with data sets that do not fit in the RAM of a single machine by dividing them into smaller blocks and performing parallel operations on those blocks. Ideal for processing large volumes of data.
Pandas DataFrame: It is an in-memory tabular data structure designed for data analysis. It is efficient for data sets that fit in the memory of a single machine, but may be limited for very large data sets due to memory limitations. Ideal for exploratory data analysis and in-memory data manipulation.

The way in which the JSON is obtained to execute is by obtaining it from a Google Storage bucket which allows us to access the information. The function load_json_from_gcs() is implemented so that it obtains the information and returns it in the form of a List of dictionaries.

In [None]:
def load_json_from_gcs() -> List[dict]:
    """
    Load JSON data from Google Cloud Storage.
    Returns:
        list: List of JSON objects loaded from the specified GCS blob.
    """
    print("Loading JSON from Google Cloud Storage")
    storage_client = storage.Client.from_service_account_json(credentials_path)
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(blob_name)

    try:
        content = blob.download_as_text()
        return [orjson.loads(line) for line in content.splitlines()]
    except Exception as e:
        print(f"Error processing the file: {e}")

Another option to obtain the JSON information is through the local file. The load_json_from_local function is built in order to carry out exhaustive tests and in this way it does not use the Bucket to an excessive extent since it may incur expenses.

In [None]:
def load_json_from_local() -> List[dict]:
    """
    Load JSON data from local file.
    Returns:
        list: List of JSON objects loaded from the specified local file.
    """
    gcp_file = []
    with open(json_file_local_path, 'r') as file:
        for line in file:
            gcp_file.append(json.loads(line))
    return gcp_file