The most common format for machine learning data is CSV files. There are a number of ways to
load a CSV file in Python. In this lesson you will learn three ways that you can use to load
your CSV data in Python:

* Load CSV Files with the Python Standard Library.
* Load CSV Files with NumPy.
* Load CSV Files with Pandas.

#### Considerations When Loading CSV Data

There are a number of considerations when loading your machine learning data from CSV files.
For reference, you can learn a lot about the expectations for CSV files by reviewing the CSV
request for comment titled Common Format and MIME Type for Comma-Separated Values
(CSV) Files1.

#### File Header

Does your data have a file header? If so this can help in automatically assigning names to each
column of data. If not, you may need to name your attributes manually. Either way, you should
explicitly specify whether or not your CSV file had a file header when loading your data.

#### Comments

Does your data have comments? Comments in a CSV file are indicated by a hash (#) at the
start of a line. If you have comments in your file, depending on the method used to load your
data, you may need to indicate whether or not to expect comments and the character to expect
to signify a comment line.

#### Delimiter

The standard delimiter that separates values in fields is the comma (,) character. Your file could
use a different delimiter like tab or white space in which case you must specify it explicitly.

#### Quotes

Sometimes field values can have spaces. In these CSV files the values are often quoted. The
default quote character is the double quotation marks character. Other characters can be used,
and you must specify the quote character used in your file.

#### Load CSV Files with the Python Standard Library

The Python API provides the module CSV and the function reader() that can be used to load
CSV files. Once loaded, you can convert the CSV data to a NumPy array and use it for machine
learning.

In [20]:
# Load CSV Using Python Standard Library
import csv
import numpy
filename = './chapter_04/pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rt')
reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE)
x = list(reader)
data = numpy.array(x).astype('float')
print(data.shape)

(768, 9)


In [21]:
data

array([[  6.   , 148.   ,  72.   , ...,   0.627,  50.   ,   1.   ],
       [  1.   ,  85.   ,  66.   , ...,   0.351,  31.   ,   0.   ],
       [  8.   , 183.   ,  64.   , ...,   0.672,  32.   ,   1.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,   0.245,  30.   ,   0.   ],
       [  1.   , 126.   ,  60.   , ...,   0.349,  47.   ,   1.   ],
       [  1.   ,  93.   ,  70.   , ...,   0.315,  23.   ,   0.   ]])

#### Load CSV Files with NumPy

You can load your CSV data using NumPy and the numpy.loadtxt() function. This function
assumes no header row and all data has the same format.

In [17]:
# Load CSV using NumPy
from numpy import loadtxt
filename = './chapter_04/pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rt')
data = loadtxt(raw_data, delimiter=",")
print(data.shape)

(768, 9)


In [19]:
data

array([[  6.   , 148.   ,  72.   , ...,   0.627,  50.   ,   1.   ],
       [  1.   ,  85.   ,  66.   , ...,   0.351,  31.   ,   0.   ],
       [  8.   , 183.   ,  64.   , ...,   0.672,  32.   ,   1.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,   0.245,  30.   ,   0.   ],
       [  1.   , 126.   ,  60.   , ...,   0.349,  47.   ,   1.   ],
       [  1.   ,  93.   ,  70.   , ...,   0.315,  23.   ,   0.   ]])

#### Load CSV Files with Pandas

You can load your CSV data using Pandas and the pandas.read csv() function. This function
is very 
exible and is perhaps my recommended approach for loading your machine learning
data. The function returns a pandas.DataFrame7 that you can immediately start summarizing
and plotting.

In [15]:
# Load CSV using Pandas
from pandas import read_csv
filename = './chapter_04/pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
print(data.shape)

(768, 9)


In [16]:
data.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
