# Load CSV data

This tutorial provides an example of how to load CSV data from a file into a `tf.data.Dataset`.

The data used in this tutorial are taken from the Titanic passenger list. The model will predict the likelihood a passenger survived based on characteristics like age, gender, ticket class, and whether the person was traveling alone.

In [1]:
import functools

import numpy as np
import tensorflow as tf

In [2]:
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)

Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/eval.csv


In [4]:
# Make numpy values easier to read.
np.set_printoptions(precision=3, suppress=True)

## Load data

To start, let's look at the top of the CSV file to see how it is formatted.

In [5]:
!head {train_file_path}

survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
1,female,35.0,1,0,53.1,First,C,Southampton,n
0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
1,female,4.0,1,1,16.7,Third,G,Southampton,n


You can [load this using pandas](pandas.ipynb), and pass the NumPy arrays to TensorFlow. If you need to scale up to a large set of files, or need a loader that integrates with [TensorFlow and tf.data](../../guide/data.ipynb) then use the `tf.data.experimental.make_csv_dataset` function:

The only column you need to identify explicitly is the one with the value that the model is intended to predict. 

In [6]:
LABEL_COLUMN = 'survived'
LABELS = [0, 1]

Now read the CSV data from the file and create a dataset. 

(For the full documentation, see `tf.data.experimental.make_csv_dataset`)


In [None]:
def get_dataset(filepath,**kwargs):
    dataset = tf.data.experimental.make_csv_dataset(filepath,batch_size=5,label_name=LABEL_COLUMN,na_value='?',
                                                    num_epochs=1,ignore_errors=True,**kwargs)
    