# How to read data from Cloud Storage

This notebook demonstrates how to load a CSV file from Google Cloud Storage with Python.
This notebook uses the [gsutil](https://cloud.google.com/storage/docs/gsutil) command line tool for fetching the data.

There are other very good ways of doing this same thing:

- [Google Python Client](https://github.com/googleapis/google-cloud-python/tree/master/storage)
- [Tensorflow gfile module](https://www.tensorflow.org/api_docs/python/tf/io/gfile)

The use of `gsutil` here is intended to demonstrate a common tool that is useful for working with
data in Cloud Storage in many contexts.

## Setup

In [None]:
from io import StringIO

import matplotlib.pyplot as plt
import pandas as pd

# Enable IPython to display matplotlib graphs.
%matplotlib inline

In [None]:
SAMPLE_INFO_CSV = (
    "gs://genomics-public-data/1000-genomes/other/sample_info/sample_info.csv"
)

## Retrieve a CSV

In [None]:
# Use the IPython "!" syntax to call a command-line
file_contents = !gsutil cat {SAMPLE_INFO_CSV}

# The "!" syntax returns a special IPython type (IPython.utils.text.SList)
# where each line is a separate item in the list.
# Let's look at the first two lines:
file_contents[0:1]

In [None]:
# We are interested in loading this into a dataframe, so we need to convert it to a single string
sample_info = pd.read_csv(StringIO("\n".join(file_contents)), engine="python")

In [None]:
# Let's see the structure of what we loaded
sample_info.info()

In [None]:
sample_info.head()

In [None]:
# Pandas can give us some summary analysis of the numeric fields
sample_info.describe()

In [None]:
counts_by_super = sample_info["Super_Population"].value_counts()
plt.pie(counts_by_super.values, labels=counts_by_super.index, autopct="%1.1f%%")
plt.show()

In [None]:
counts_by_population = sample_info["Population"].value_counts()
plt.pie(
    counts_by_population.values, labels=counts_by_population.index, autopct="%1.1f%%"
)
plt.show()

# Provenance

In [None]:
import datetime

print(datetime.datetime.now())

In [None]:
!pip3 freeze

Copyright 2018 The Broad Institute, Inc., Verily Life Sciences, LLC All rights reserved.

This software may be modified and distributed under the terms of the BSD license. See the LICENSE file for details.