In [1]:
import pandas as pd
import requests as r
from io import StringIO

## Storage access time assessment
If you are reading and writing a large number of files during your analysis, it may make a difference where those files are stored. In this notebook, we will examine timings for transfer of files into and out of programs from different locations.
### Timings for File Inputs to a dataframe

#### From Panasas (parallel storage)

In [2]:
%%time
df = pd.read_csv("/home/gems_learning/shared/hpc4ag/3k-core-v7-chr1/chr1.vcf", sep="\t", skiprows=6)

CPU times: user 15.1 s, sys: 1.75 s, total: 16.8 s
Wall time: 17 s


In [3]:
%timeit df = pd.read_csv("/home/gems_learning/shared/hpc4ag/3k-core-v7-chr1/chr1.vcf", sep="\t", skiprows=6)

11.4 s ± 144 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### From Tier 2 (S3 storage)
##### Define a function to grab a dataframe from s3

In [4]:
def load_s3_csv(url: str, **kwargs) -> pd.DataFrame:
    """Utility to load S3 csvs into pandas DataFrames.

    Args:
        url (str): S3 url (https)

    Returns:
        pd.DataFrame: containing csv at provided url.
    """
    # using this to get around pandas ssl error when reading url directly
    res = r.get(url)
    assert res.status_code == 200, f'Failed to read {url}'
    csv_str = res.text
    df = pd.read_csv(StringIO(csv_str), **kwargs)
    return df


##### Use the Unix s3cmd command to make sure our desired file is on Tier 2

In [5]:
! s3cmd ls s3://hpc4ag; s3cmd ls s3://hpc4ag/3k-core-v7-chr1/

                    DIR  s3://hpc4ag/3k-core-v7-chr1/
                    DIR  s3://hpc4ag/csb/
2024-11-11 16:18   491M  s3://hpc4ag/3k-core-v7-chr1/chr1.vcf


##### Load the dataframe from this object store and see how long it taks!

In [6]:
%%time
df = load_s3_csv("https://s3.msi.umn.edu/hpc4ag/3k-core-v7-chr1/chr1.vcf", sep="\t", skiprows=6)

CPU times: user 13.2 s, sys: 4.55 s, total: 17.7 s
Wall time: 20.1 s


In [7]:
%timeit load_s3_csv("https://s3.msi.umn.edu/hpc4ag/3k-core-v7-chr1/chr1.vcf", sep="\t", skiprows=6)

19.7 s ± 2.89 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### From local scratch

In [8]:
%%time
! cp /home/gems_learning/shared/hpc4ag/3k-core-v7-chr1/chr1.vcf /scratch.local/

CPU times: user 6.2 ms, sys: 38.8 ms, total: 45 ms
Wall time: 949 ms


In [9]:
%%time
df = pd.read_csv("/scratch.local/chr1.vcf", sep="\t", skiprows=6)

CPU times: user 11.8 s, sys: 1.87 s, total: 13.7 s
Wall time: 13.8 s


#### From /scratch.global (VAST storage)?

In [10]:
%%time
! cp /home/gems_learning/shared/hpc4ag/3k-core-v7-chr1/chr1.vcf /scratch.global/hpc4ag/

CPU times: user 6.4 ms, sys: 60.5 ms, total: 66.9 ms
Wall time: 1.03 s


In [11]:
%%time
df = pd.read_csv("/scratch.global/hpc4ag/chr1.vcf", sep="\t", skiprows=6)

CPU times: user 11.9 s, sys: 1.91 s, total: 13.8 s
Wall time: 14.2 s


## Writing files
Is there any difference in writing to these varied storage devices?
#### Panasas

In [12]:
%%time
df.to_csv("/home/gems_learning/shared/hpc4ag/test.vcf", index=False, sep="\t")

CPU times: user 22.7 s, sys: 1.41 s, total: 24.1 s
Wall time: 37.4 s


#### Tier 2

In [13]:
%%time
df.to_csv(StringIO("https://s3.msi.umn.edu/kats/chr1.vcf"), sep="\t")

CPU times: user 20.7 s, sys: 1.37 s, total: 22.1 s
Wall time: 22.2 s


#### Local scratch

In [14]:
%%time
df.to_csv("/scratch.local/test.vcf", index=False, sep="\t")

CPU times: user 20.4 s, sys: 744 ms, total: 21.2 s
Wall time: 21.4 s


#### Global scratch

In [15]:
%%time
df.to_csv("/scratch.global/test.vcf", index=False, sep="\t")

CPU times: user 20.3 s, sys: 451 ms, total: 20.8 s
Wall time: 21.2 s
