## Interacting with files

In [1]:
from pubweb import PubWeb

client = PubWeb()

In [2]:
project = client.project.find_by_name('Test Project')
datasets = client.dataset.find_by_project(project_id=project.id, name='Test of mageck-count')
dataset = datasets[0]

files = client.dataset.get_dataset_files(project_id=project.id,
                                         dataset_id=dataset.id)

In [7]:
from pubweb.file_utils import filter_files_by_pattern

counts_file = filter_files_by_pattern(files, '**/counts.txt')[0]
# You can also filter manually
# counts_file = next((f for f in files if f.name == 'counts.txt'))
print(counts_file)

File(path=data/mageck/count/combined/counts.txt)


If you don't already have access to the file, you must use the file service to get the file contents as a string.

In [4]:
counts = client.file.get_file(counts_file)

From here you can load it into a dataframe by wrapping it in `StringIO`

In [5]:
import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO(counts), sep='\t')
df.head()

Unnamed: 0,sgRNA,Gene,MO_Brunello_gDNA_2,MO_Brunello_1,MO_Brunello_2,MO_Brunello_gDNA_1
0,A1BG_0,A1BG,0,0,0,0
1,A1BG_1,A1BG,0,0,0,2
2,A1BG_2,A1BG,0,0,0,0
3,A1BG_3,A1BG,0,0,2,0
4,A1CF_36946,A1CF,0,0,0,0


If you already have IAM access to the file location, you can just feed the absolute path into `read_csv` directly.

Note that you must also have the package `s3fs` installed.

In [6]:
df = pd.read_csv(counts_file.absolute_path, sep='\t')
df.head()

Unnamed: 0,sgRNA,Gene,MO_Brunello_gDNA_2,MO_Brunello_1,MO_Brunello_2,MO_Brunello_gDNA_1
0,A1BG_0,A1BG,0,0,0,0
1,A1BG_1,A1BG,0,0,0,2
2,A1BG_2,A1BG,0,0,0,0
3,A1BG_3,A1BG,0,0,2,0
4,A1CF_36946,A1CF,0,0,0,0
