# Fasting Pandas - A guide into optimizing your analytical processing 

## Lesson 3 - Read And Write

---

We now dive into the realm of files. We want to show we are better than those pesky Excel and CSV pencil pushers. This is an important step to start giving an impression of actually knowing something.

Before going any further, I will outright tell you I'm not going to talk about databases. This is out of the scope for this tutorial, for now. Buy me some coffee and I'll think about it.

All right, so why did we look into parsing before file management? The main reason is that parsing will have a direct impact on the efficiency and benefits of file formats, so let's start by generating a dataframe. We will now increment our dataset to 20 million rows. If this is too much for your computer to handle feel free to downsize the sample.

I created a few functions to help with this endeavor. We will start with a quick comparison on memory optimization.

In [1]:
import config
import os
import fasting_pandas as fp
import pandas as pd
import numpy as np
from IPython.display import Markdown as md

In [2]:
def check_dataframe_size(dataframe: pd.DataFrame):
    """
    Returns the size of a DataFrame in bytes.

    Parameters:
        dataframe: dataframe object to check the size of

    Returns:
        integer representing the size of the file in bytes
    """
    size = dataframe.memory_usage(deep=True).sum()

    prefixes = ['B', 'KB', 'MB', 'GB', 'TB']
    index = 0
    while size >= 1024 and index < len(prefixes) - 1:
        size /= 1024
        index += 1

    return f"{size:.2f} {prefixes[index]}"

In [3]:
# Generate new dataframe
bad_df = fp.generate_teamresult_df(20_000_000)
print(f'dataframe memory usage is: {check_dataframe_size(bad_df)}')
# Optimize data types
df = fp.set_dtypes_for_teamresult_df(bad_df)
print(f'optimized dataframe memory usage is: {df.memory_usage(deep=True).sum() / 1024 ** 2:.2f} MB')

dataframe memory usage is: 3.81 GB
optimized dataframe memory usage is: 267.03 MB


In [4]:
md( 
    f"<p style = 'font-size: 15px;'>Ok, so parsing reduced our memory footprint from {check_dataframe_size(bad_df)} to {check_dataframe_size(df)} which is a {abs((df.memory_usage(deep=True).sum() / bad_df.memory_usage(deep=True).sum())-1)*100:.2f}% decrease in memory.</p>"
    "<p style = 'font-size: 15px;'>The highest benefits were returned by changing the object to category columns. We also did something clever by changing the logic on how to view results. Instead of having a string column that categorized the results by either win or lose, we created a column called win that mapped each win as True and lose as False. Our function failed to calculate the performance improvement because of the renaming of the column, but we can calculate it ourselves.</p>"
)

<p style = 'font-size: 15px;'>Ok, so parsing reduced our memory footprint from 3.81 GB to 267.03 MB which is a 93.16% decrease in memory.</p><p style = 'font-size: 15px;'>The highest benefits were returned by changing the object to category columns. We also did something clever by changing the logic on how to view results. Instead of having a string column that categorized the results by either win or lose, we created a column called win that mapped each win as True and lose as False. Our function failed to calculate the performance improvement because of the renaming of the column, but we can calculate it ourselves.</p>

In [5]:
before, after = bad_df.result.memory_usage(deep=True), df.win.memory_usage(deep=True)
print("{:.2f}".format((after - before) / before * 100) + '%')

-98.35%


Another 98%, which is nice. Let's move on to saving our work. First, we will release memory by calling the garbage collector on the **bad_df** dataframe since it's consuming lots of precious memory.

In [6]:
for d in ['bad_df']:
    try:
        del locals()[d]
    except KeyError as e:
        print(f'KeyError: {e}')

## CSV

CSVs are at the forefront of data analytics. Chances are you have encountered one of these bad boys in the wild, and bad boys they are indeed. It's soul crunching having to deal with them, so why are they so popular in the first place?

The simple truth is people commonly use Excel and CSVs because they are easy to understand. CSV stands for "comma-separated values," and as the name suggests, a CSV file is a plain text file where each row represents a record, and each column is separated by a comma.

They can be opened in any text editor or spreadsheet program, making them accessible to almost anyone. Since they are just plain text files, they're platform-independent and can be easily exchanged between different systems.

Thus, they have spread all across the globe like wildfire. People are used to exporting their data as a CSV, so odds are that's what you will have to deal with most of the time.

However their simplicity is also their demise. They have some serious drawbacks that make them less than ideal for data analytics:

1. Lack of standardization: While CSV files are a widely used file format, there is no standard specification for how they should be formatted, and different tools and programming languages may interpret them differently. This can lead to issues when trying to import or export CSV files between different systems.

2. Limited data types: CSV files can only store data as text, which means that numeric or date/time values need to be converted to text format before they can be stored in a CSV file. This can lead to data loss or errors when importing or exporting data.

3. Large file sizes: CSV files can become very large very quickly, especially when dealing with large datasets with many columns. This can lead to slow loading times and increased storage requirements.

4. Lack of data validation: CSV files do not provide a way to enforce data validation rules, which means that the data in the file may contain errors or inconsistencies.

So let's look at some performance metrics shall we. For this and all subsequent exercises we will measure time to read and write, file size, and memory usage.

I created a few functions to help with this endeavor.

In [7]:
def get_file_size(file_name: str):
    """
    Returns the size of a file in bytes.

    Parameters:
        filename: string representing the name of the file to check the size of

    Returns:
        integer representing the size of the file in bytes
    """
    if not os.path.isfile(file_name):
        raise ValueError(f"'{file_name}' is not a file")

    size = os.path.getsize(file_name)

    prefixes = ['B', 'KB', 'MB', 'GB', 'TB']
    index = 0
    while size >= 1024 and index < len(prefixes) - 1:
        size /= 1024
        index += 1

    return f"{size:.2f} {prefixes[index]}"

In [8]:
# The saving and loading functions are combined with a wrap function from the fasting_pandas library to measure time.
from fasting_pandas import timeit

@timeit
def save_dataframe(df: pd.DataFrame, file_name: str, file_format: str, directory: str = config.DATA_DIR, include_index: bool = False):     
    """
    Saves a pandas DataFrame to a file in the specified format in the specified directory.

    Parameters:
    - df: pandas DataFrame to save
    - file_format: string representing the file format to save the DataFrame in ('csv', 'json', 'parquet', or 'pickle')
    - directory: string representing the directory to save the file in
    """
    save_functions = {
        'csv': lambda file_path: df.to_csv(file_path, index=include_index),
        'json': lambda file_path: df.to_json(file_path),
        'parquet': lambda file_path: df.to_parquet(file_path, index=include_index),
        'pickle': lambda file_path: pd.DataFrame.to_pickle(df, file_path)
    }

    if file_format not in save_functions:
        raise ValueError(f"Unsupported file format: {file_format}")

    file_path = os.path.join(directory, f"{file_name}.{file_format}")
    with open(file_path, 'wb') as f:
        save_function = save_functions[file_format]
        save_function(f)

    print(f"DataFrame saved to '{file_path}' in '{file_format}' format.")

@timeit
def load_dataframe(file_name: str, file_format: str, directory = config.DATA_DIR):
    """
    Loads a pandas DataFrame from a file in the specified format in the specified directory.

    Parameters:
    - file_format: string representing the file format to load the DataFrame from ('csv', 'json', 'parquet', or 'pickle')
    - directory: string representing the directory to load the file from

    Returns:
    - pandas DataFrame loaded from the file in the specified format in the specified directory
    """
    load_functions = {
        'csv': pd.read_csv,
        'json': lambda file_path: pd.read_json(file_path),
        'parquet': pd.read_parquet,
        'pickle': pd.read_pickle
    }

    if file_format not in load_functions:
        raise ValueError(f"Unsupported file format: {file_format}")

    file_path = os.path.join(directory, f"{file_name}.{file_format}")
    load_function = load_functions[file_format]
    df = load_function(file_path)

    print(f"DataFrame loaded from '{file_path}' in '{file_format}' format.")

    return df

In [9]:
format = 'csv'
file = 'dataset'
print(f'\nSaving {file} as {format} format...')
time_to_save = save_dataframe(df, file, format)
print('-------------------------------------')
print(f'\nRetrieving {format} file size...')
file_size = get_file_size(os.path.join(config.DATA_DIR, f'{file}.{format}'))
print(file_size)
print('-------------------------------------')
print(f'\nLoading {file}.{format} into a DataFrame...')
df_test = load_dataframe(file, format)
print('-------------------------------------')
print(f'\nGetting DataFrame data types...')
print(df_test[0].dtypes)
print('-------------------------------------')
print(f'\nGetting DataFrame memory usage...')
dataframe_size = check_dataframe_size(df_test[0])
print(dataframe_size)


Saving dataset as csv format...
DataFrame saved to 'd:\Proyectos\datakai\projects\github\datakaicr\public\fasting-pandas\data\dataset.csv' in 'csv' format.
Function 'save_dataframe' took 50.26406 seconds to execute.
-------------------------------------

Retrieving csv file size...
749.68 MB
-------------------------------------

Loading dataset.csv into a DataFrame...
DataFrame loaded from 'd:\Proyectos\datakai\projects\github\datakaicr\public\fasting-pandas\data\dataset.csv' in 'csv' format.
Function 'load_dataframe' took 8.77889 seconds to execute.
-------------------------------------

Getting DataFrame data types...
size     object
age       int64
team     object
date     object
prob    float64
win        bool
dtype: object
-------------------------------------

Getting DataFrame memory usage...
3.88 GB


### Results

Well well well, so lots of interesting things going on with the CSVs.

First interesting fact is they are trash. I'll summarize the metrics.

In [10]:
md( 
"<html>"
    "<head>"
        "<style>"
        "html * {'font-size: 15px;'"
            "line-height: 1.625;"
            "color: #2020131;"
            "font-family: Nunito, sans-serif;}"
        "</style>"
    "</head>"
    f'<li style="font-size:15px";> Time to save: {time_to_save[-1]:.2f} seconds'
    f'<li style="font-size:15px";> Time to read: {df_test[-1]:.2f} seconds'
    f'<li style="font-size:15px";> Physical memory: {file_size} seconds'
    f'<li style="font-size:15px";> Virtual memory: {dataframe_size} seconds'
    f'<li style="font-size:15px";> Data Type respected: {(df.dtypes == df_test[0].dtypes).all()}'
"</html>"
)

<html><head><style>html * {'font-size: 15px;'line-height: 1.625;color: #2020131;font-family: Nunito, sans-serif;}</style></head><li style="font-size:15px";> Time to save: 50.26 seconds<li style="font-size:15px";> Time to read: 8.78 seconds<li style="font-size:15px";> Physical memory: 749.68 MB seconds<li style="font-size:15px";> Virtual memory: 3.88 GB seconds<li style="font-size:15px";> Data Type respected: False</html>

That is grotesque to say the least. Imagine that you had to share that file with a coworker or upload it into an application. What about just having to do some basic data wrangling? Will we have to merge this file with another one? We will even have to parse the data types again from scratch! I hope you can see my point.

If you are serious about data analytics then it's time to learn about other file formats. 

Lesson, avoid CSVs like the plague unless you don't have a choice. Thankfully, this is the last time we will use them for this tutorial. Ok thank you byeeeee

In [11]:
try:
    os.remove(os.path.join(config.DATA_DIR, f'{file}.{format}'))
except FileNotFoundError as e:
    print(e)

## JSON

All right slick, let's say you had already tried to move on from CSV files and knew about JSON. They should be better since it's the internet's favorite file format, right? Well... not quite.

JSON (JavaScript Object Notation) has become a popular file format for data analytics and web APIs due to its simplicity, flexibility, and widespread support among programming languages and frameworks. Unlike traditional data formats like CSV, which are limited to simple tabular data, JSON can handle complex, nested data structures with ease.

One of the main reasons why JSON is popular for web APIs is because it is a text-based format that can be easily parsed by web browsers and other HTTP clients. This makes it a convenient choice for transmitting data over the internet. Additionally, JSON has a standardized format that makes it easy to integrate with different programming languages and platforms.

But...

There are some drawbacks to using JSON as a file format for data analytics. One of the main issues is that JSON files tend to be larger than traditional binary formats, which can make them slower to load and process. Additionally, JSON does not have a standardized schema, which means that the structure of the data may vary from one file to another. This can make it difficult to work with large datasets that require a consistent schema.

Another issue with using JSON for data analytics is that it can be difficult to query and analyze the data. Because JSON is a hierarchical format, it can be challenging to extract specific fields or perform complex queries. This can make it harder to gain insights from the data and may require additional processing steps.

It may not sound that bad afterall, maybe you might find them easier to cope with than CSVs. But let's see them in action.

In [12]:
# Disclosure. You can skip this step if you want, or reduce the size of the DataFrame. It takes A LOT of time to load, between 5 - 6 minutes in my computer.
format = 'json'
file = 'dataset'
print(f'\nSaving {file} as {format} format...')
time_to_save = save_dataframe(df, file, format)
print('-------------------------------------')
print(f'\nRetrieving {format} file size...')
file_size = get_file_size(os.path.join(config.DATA_DIR, f'{file}.{format}'))
print(file_size)
print('-------------------------------------')
print(f'\nLoading {file}.{format} into a DataFrame...')
df_test = load_dataframe(file, format)
print('-------------------------------------')
print(f'\nGetting DataFrame data types...')
print(df_test[0].dtypes)
print('-------------------------------------')
print(f'\nGetting DataFrame memory usage...')
dataframe_size = check_dataframe_size(df_test[0])
print(dataframe_size)


Saving dataset as json format...
DataFrame saved to 'd:\Proyectos\datakai\projects\github\datakaicr\public\fasting-pandas\data\dataset.json' in 'json' format.
Function 'save_dataframe' took 25.93704 seconds to execute.
-------------------------------------

Retrieving json file size...
2.12 GB
-------------------------------------

Loading dataset.json into a DataFrame...
DataFrame loaded from 'd:\Proyectos\datakai\projects\github\datakaicr\public\fasting-pandas\data\dataset.json' in 'json' format.
Function 'load_dataframe' took 244.76230 seconds to execute.
-------------------------------------

Getting DataFrame data types...
size            object
age              int64
team            object
date    datetime64[ns]
prob           float64
win               bool
dtype: object
-------------------------------------

Getting DataFrame memory usage...
2.93 GB


### Results

In [13]:
md( 
"<html>"
    "<head>"
        "<style>"
        "html * {'font-size: 15px;'"
            "line-height: 1.625;"
            "color: #2020131;"
            "font-family: Nunito, sans-serif;}"
        "</style>"
    "</head>"
    f'<li style="font-size:15px";> Time to save: {time_to_save[-1]:.2f} seconds'
    f'<li style="font-size:15px";> Time to read: {df_test[-1]:.2f} seconds'
    f'<li style="font-size:15px";> Physical memory: {file_size} seconds'
    f'<li style="font-size:15px";> Virtual memory: {dataframe_size} seconds'
    f'<li style="font-size:15px";> Data Type respected: {(df.dtypes == df_test[0].dtypes).all()}'
"</html>"
)

<html><head><style>html * {'font-size: 15px;'line-height: 1.625;color: #2020131;font-family: Nunito, sans-serif;}</style></head><li style="font-size:15px";> Time to save: 25.94 seconds<li style="font-size:15px";> Time to read: 244.76 seconds<li style="font-size:15px";> Physical memory: 2.12 GB seconds<li style="font-size:15px";> Virtual memory: 2.93 GB seconds<li style="font-size:15px";> Data Type respected: False</html>

Our physical memory and time to read shot through the roof. The JSON file was saved without compression, which ended with atrocious results. 

Fortunately, there are several techniques for compressing JSON file sizes. One of the most common approaches is to use gzip compression, which is a widely used compression algorithm that can significantly reduce the size of JSON files. gzip works by compressing the file and then sending it over the network, where it can be decompressed on the receiving end.

Another approach for compressing JSON file sizes is to use a format such as BSON (Binary JSON), which is a binary representation of JSON that can be more compact and efficient for certain use cases.

When using pandas, the Python "json" library includes the `json.dumps()` method, which can be used to encode JSON data and optionally compress it using gzip. I won't go over it in this tutorial, since the point is moot. We can't avoid the parsing issue in any case, so I want to show you some better options.

In [14]:
try:
    os.remove(os.path.join(config.DATA_DIR, f'{file}.{format}'))
except FileNotFoundError as e:
    print(e)

## Pickle

The Python "pickle" module provides a powerful way to serialize and de-serialize Python objects. Pickle is a binary file format that can be used to store and serialize complex data structures, including nested lists, dictionaries, and custom objects. Unlike JSON and CSV, which are limited to flat data structures and cannot handle more complex types, Pickle can serialize almost any object in Python, making it an ideal choice for scientific and machine learning applications that deal with large and complex datasets.

Another advantage of Pickle is its efficiency and speed, unlike JSON and CSV which require parsing and formatting the data every time it is read or written, Since Pickle stores the data in a binary format it can be quickly and easily read and written to disk. This makes it an ideal choice for applications that require fast I/O performance, such as real-time data processing or high-performance computing.

Pickle also offers strong support for versioning and backward compatibility. Unlike JSON and CSV, which do not provide any built-in support for versioning, Pickle allows you to serialize and deserialize objects across different Python versions, ensuring that your data remains accessible and usable even as your code evolves and changes over time.

In addition, Pickle provides built-in compression support, which allows you to reduce the size of your data files and exchange data more efficiently over the network. This can be especially useful when dealing with large datasets or when working with distributed systems that require fast and efficient data exchange.

Overall, while JSON and CSV are useful for simple and flat data structures, Pickle offers a superior option for more complex and varied data types. Its support for complex objects, versioning, compression, and speed make it a powerful and flexible choice for storing and exchanging data in Python.

Let's look at the numbers.

In [15]:
format = 'pickle'
file = 'dataset'
print(f'\nSaving {file} as {format} format...')
time_to_save = save_dataframe(df, file, format)
print('-------------------------------------')
print(f'\nRetrieving {format} file size...')
file_size = get_file_size(os.path.join(config.DATA_DIR, f'{file}.{format}'))
print(file_size)
print('-------------------------------------')
print(f'\nLoading {file}.{format} into a DataFrame...')
df_test = load_dataframe(file, format)
print('-------------------------------------')
print(f'\nGetting DataFrame data types...')
print(df_test[0].dtypes)
print('-------------------------------------')
print(f'\nGetting DataFrame memory usage...')
dataframe_size = check_dataframe_size(df_test[0])
print(dataframe_size)


Saving dataset as pickle format...
DataFrame saved to 'd:\Proyectos\datakai\projects\github\datakaicr\public\fasting-pandas\data\dataset.pickle' in 'pickle' format.
Function 'save_dataframe' took 2.40936 seconds to execute.
-------------------------------------

Retrieving pickle file size...
267.03 MB
-------------------------------------

Loading dataset.pickle into a DataFrame...
DataFrame loaded from 'd:\Proyectos\datakai\projects\github\datakaicr\public\fasting-pandas\data\dataset.pickle' in 'pickle' format.
Function 'load_dataframe' took 0.07739 seconds to execute.
-------------------------------------

Getting DataFrame data types...
size          category
age               int8
team          category
date    datetime64[ns]
prob           float16
win               bool
dtype: object
-------------------------------------

Getting DataFrame memory usage...
267.03 MB


Results

In [16]:
md( 
"<html>"
    "<head>"
        "<style>"
        "html * {'font-size: 15px;'"
            "line-height: 1.625;"
            "color: #2020131;"
            "font-family: Nunito, sans-serif;}"
        "</style>"
    "</head>"
    f'<li style="font-size:15px";> Time to save: {time_to_save[-1]:.2f} seconds'
    f'<li style="font-size:15px";> Time to read: {df_test[-1]:.2f} seconds'
    f'<li style="font-size:15px";> Physical memory: {file_size} seconds'
    f'<li style="font-size:15px";> Virtual memory: {dataframe_size} seconds'
    f'<li style="font-size:15px";> Data Type respected: {(df.dtypes == df_test[0].dtypes).all()}'
"</html>"
)

<html><head><style>html * {'font-size: 15px;'line-height: 1.625;color: #2020131;font-family: Nunito, sans-serif;}</style></head><li style="font-size:15px";> Time to save: 2.41 seconds<li style="font-size:15px";> Time to read: 0.08 seconds<li style="font-size:15px";> Physical memory: 267.03 MB seconds<li style="font-size:15px";> Virtual memory: 267.03 MB seconds<li style="font-size:15px";> Data Type respected: True</html>

Isn't this a real improvement? Insane read and write speeds combined with proper data parsing. So what's the catch?

The pickle format is specific to Python and may not be compatible with other programming languages or platforms. This can be a significant limitation in some use cases where interoperability with other systems is necessary.

In addition, there are some potential drawbacks and security risks associated with the pickle format. For example, unpickling untrusted or malicious pickle data can lead to security vulnerabilities such as arbitrary code execution. This can be a serious concern in applications where the pickle data may come from untrusted sources.

Furthermore, the pickle format is not well-suited for certain types of data structures, such as those that contain circular references or objects with complex interdependencies. In these cases, it may be necessary to use a different serialization format that is better suited to the data being processed.

Overall, while the pickle format can be a powerful and convenient tool for data serialization and persistence in Python, it is important to be aware of its limitations and potential risks. You should carefully consider it's use case and the security implications of using the pickle format before deciding whether to use it in your application.

In [17]:
try:
    os.remove(os.path.join(config.DATA_DIR, f'{file}.{format}'))
except FileNotFoundError as e:
    print(e)

## Parquet

I saved the best for last.

Parquet is a columnar storage format that is designed for efficient processing and storage of large datasets. Parquet stores data in a compressed and serialized format, which reduces the size of the data and allows for faster and more efficient processing. In addition, Parquet provides advanced features such as predicate pushdown, which allows for faster filtering of data, and schema evolution, which allows for easy updates to the data schema without requiring a full rebuild.

While Pickle is flexible and easy to use, it can suffer from performance and scalability issues when dealing with large datasets. Pickle does not support parallel processing or distributed computing, which can limit its scalability in large-scale applications.

On the other hand, Parquet is designed for large-scale data processing and is optimized for parallel processing and distributed computing. Parquet can be used with tools such as Apache Spark and Apache Hadoop to process and analyze large datasets in a distributed environment. In addition, Parquet provides built-in support for compression, encoding, and serialization, which can further improve its performance and scalability.

Overall, while Pickle is a flexible and easy-to-use format for storing and exchanging data in Python, it is not well-suited for large-scale data processing and can suffer from performance and scalability issues. Parquet, on the other hand, is optimized for efficient processing and storage of large datasets and provides advanced features such as predicate pushdown and schema evolution, making it an ideal choice for big data applications that require high performance and scalability.

Before I go on, you should know something about Parquet. While Parquet supports many data types, including numeric, string, and date/time types, it does not support the float16 data type.

The main reason for this is that float16 is a relatively rare data type that is not widely used in most data processing applications. In addition, float16 has a limited precision compared to other numeric types, which can lead to accuracy issues when performing complex calculations or analysis. 

No biggie, we will change the data type to float32.

In [18]:
df.prob = df.prob.astype('float32')
format = 'parquet'
file = 'dataset'
print(f'\nSaving {file} as {format} format...')
time_to_save = save_dataframe(df, file, format)
print('-------------------------------------')
print(f'\nRetrieving {format} file size...')
file_size = get_file_size(os.path.join(config.DATA_DIR, f'{file}.{format}'))
print(file_size)
print('-------------------------------------')
print(f'\nLoading {file}.{format} into a DataFrame...')
df_test = load_dataframe(file, format)
print('-------------------------------------')
print(f'\nGetting DataFrame data types...')
print(df_test[0].dtypes)
print('-------------------------------------')
print(f'\nGetting DataFrame memory usage...')
dataframe_size = check_dataframe_size(df_test[0])
print(dataframe_size)


Saving dataset as parquet format...
DataFrame saved to 'd:\Proyectos\datakai\projects\github\datakaicr\public\fasting-pandas\data\dataset.parquet' in 'parquet' format.
Function 'save_dataframe' took 2.59641 seconds to execute.
-------------------------------------

Retrieving parquet file size...
90.87 MB
-------------------------------------

Loading dataset.parquet into a DataFrame...
DataFrame loaded from 'd:\Proyectos\datakai\projects\github\datakaicr\public\fasting-pandas\data\dataset.parquet' in 'parquet' format.
Function 'load_dataframe' took 0.55483 seconds to execute.
-------------------------------------

Getting DataFrame data types...
size          category
age               int8
team          category
date    datetime64[ns]
prob           float32
win               bool
dtype: object
-------------------------------------

Getting DataFrame memory usage...
305.18 MB


### Results

In [19]:
md( 
"<html>"
    "<head>"
        "<style>"
        "html * {'font-size: 15px;'"
            "line-height: 1.625;"
            "color: #2020131;"
            "font-family: Nunito, sans-serif;}"
        "</style>"
    "</head>"
    f'<li style="font-size:15px";> Time to save: {time_to_save[-1]:.2f} seconds'
    f'<li style="font-size:15px";> Time to read: {df_test[-1]:.2f} seconds'
    f'<li style="font-size:15px";> Physical memory: {file_size} seconds'
    f'<li style="font-size:15px";> Virtual memory: {dataframe_size} seconds'
    f'<li style="font-size:15px";> Data Type respected: {(df.dtypes == df_test[0].dtypes).all()}'
"</html>"
)

<html><head><style>html * {'font-size: 15px;'line-height: 1.625;color: #2020131;font-family: Nunito, sans-serif;}</style></head><li style="font-size:15px";> Time to save: 2.60 seconds<li style="font-size:15px";> Time to read: 0.55 seconds<li style="font-size:15px";> Physical memory: 90.87 MB seconds<li style="font-size:15px";> Virtual memory: 305.18 MB seconds<li style="font-size:15px";> Data Type respected: True</html>

We sacrifice a tiny bit the I/O timings and virtual memory, but gained a huge increment in file compression. By storing data in a columnar format, Parquet is able to minimize the amount of storage space required for large datasets, reducing costs and improving performance. This is particularly important for big data processing applications, where storage costs can be a significant factor.

In [20]:
try:
    os.remove(os.path.join(config.DATA_DIR, f'{file}.{format}'))
except FileNotFoundError as e:
    print(e)

## Conclusion

By comparing CSV, JSON, pickle, and Parquet files, it is clear that each format has its own strengths and weaknesses when it comes to data processing and analytics. CSV and JSON are both popular file formats that are widely used for data exchange, but they are not as efficient or scalable as Parquet for large-scale analytics. Pickle is a powerful serialization format that is specific to Python, but it can be risky and may not be compatible with other platforms.

When it comes to data wrangling, engineering, and analytics, parsing and manipulating dataframes in a correct manner are all important factors to consider. It is critical to choose a data format that is well-suited to your specific use case, taking into account factors such as data type, storage efficiency, query performance, and compatibility with other tools and platforms.

In conclusion, while there is no single "best" data format for all data processing applications, Parquet is an excellent choice for many data analytics use cases. Its efficient storage, scalability, and compatibility with a wide range of tools and platforms make it a powerful tool for big data processing and analytics.

This ends the tutorial for now. I truly hope that you found these lessons helpful and that you gained some valuable tips and tricks that you can start using right away. If you found it beneficial, please let me know by leaving a comment. If you didn't find it helpful, please share your thoughts in a more detailed comment. Your feedback is greatly appreciated and helps me improve. Thank you!