## How To Load Machine Learning Data


The
most common format for machine learning data is CSV files. There are three ways that you can use to load
your CSV data in Python:
1. Load CSV Files with the Python Standard Library.
2. Load CSV Files with NumPy.
3. Load CSV Files with Pandas.

### Pima Indians Dataset

The Pima Indians dataset is used to demonstrate data loading. This dataset describes the medical records for Pima Indians and whether or not each patient will have an onset of diabetes within five years. As such it is a classification problem. It is a good dataset for demonstration because all of the input attributes are numeric and the output variable to be predicted is binary (0 or 1). The data is freely available from the UCI Machine Learning Repository

### Load CSV Files with the Python Standard Library

The Python API provides the module CSV and the function reader() that can be used to load
CSV files. Once loaded, you can convert the CSV data to a NumPy array and use it for 
machine learning. For example, you can download the Pima Indians dataset into your local 
directory with the filename pima-indians-diabetes.data.csv. All fields in this 
dataset are numeric and there is no header line.

In [7]:
import csv
import numpy as np
filename = "pima-indians-diabetes.data.csv" # insert any 2D dataset
raw_data = open(filename, 'rt')
reader = csv.reader(raw_data, delimiter=',', quoting = csv.QUOTE_NONE)
x = list(reader)
data = np.array(x).astype('float')
print(data.shape)

(768, 9)


The example loads an object that can iterate over each row of the data and can easily be
converted into a NumPy array. For more information on the csv.reader() function, see 
[CSV File Reading and Writing](https://docs.python.org/2/library/csv.html) in the Python API documentation.

### Load CSV Files with NumPy


You can load your CSV data using NumPy and the numpy.loadtxt() function. This function
assumes no header row and all data has the same format. The example below assumes that the
file pima-indians-diabetes.data.csv is in your current working directory.

In [10]:
# Load CSV using NumPy
from numpy import loadtxt
filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rt')
data = loadtxt(raw_data, delimiter=",")
print(data.shape)

(768, 9)


For more information on the [numpy.loadtxt()](https://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.loadtxt.html) function see the API documentation.

In [13]:
# Load CSV from URL using NumPy
from numpy import loadtxt
from urllib.request import urlopen
url = 'https://goo.gl/bDdBiA'
raw_data = urlopen(url)
dataset = loadtxt(raw_data, delimiter=",")
print(dataset.shape)

(768, 9)


### Load CSV Files with Pandas
You can load your CSV data using Pandas and the pandas.read csv() function. This function
is very flexible and is perhaps my recommended approach for loading your machine learning
data. The function returns a pandas.DataFrame that you can immediately start summarizing
and plotting. The example below assumes that the pima-indians-diabetes.data.csv file is
in the current working directory.

In [15]:
# Load CSV using Pandas
from pandas import read_csv
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
print(data)
print(data.shape)

     preg  plas  pres  skin  test  mass   pedi  age  class
0       6   148    72    35     0  33.6  0.627   50      1
1       1    85    66    29     0  26.6  0.351   31      0
2       8   183    64     0     0  23.3  0.672   32      1
3       1    89    66    23    94  28.1  0.167   21      0
4       0   137    40    35   168  43.1  2.288   33      1
..    ...   ...   ...   ...   ...   ...    ...  ...    ...
763    10   101    76    48   180  32.9  0.171   63      0
764     2   122    70    27     0  36.8  0.340   27      0
765     5   121    72    23   112  26.2  0.245   30      0
766     1   126    60     0     0  30.1  0.349   47      1
767     1    93    70    31     0  30.4  0.315   23      0

[768 rows x 9 columns]
(768, 9)


In [2]:
# Load CSV using Pandas from URL
from pandas import read_csv
url = 'https://goo.gl/bDdBiA'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(url, names=names)
print(data.shape)

(768, 9)


To learn more about the [pandas.read csv()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function you can refer to the API documentation.