# Load your ML dataset

More on [CSV files](https://tools.ietf.org/html/rfc4180). 

We will explore:

* Load CSV Files with the Python Standard Library
* Load CSV Files with NumPy
* Load CSV Files with Pandas.

## Preliminary considerations

Review in particular:

* ***File Header***

* ***Comments***

* ***Delimiter***

* ***Quotes***

## Some test data 

We will use the famous "Pima Indians dataset". The data was freely available from the UCI ML Repository, and can now be found elsewhere. For your convenience it can be downloaded within the course material for this lecture, e.g.:
   * https://drive.google.com/open?id=12pjLYLeuZ__4SVPuz6zL9QQpPiwbVYDKz_rsn4eeHzI


In [None]:
!ls -trlh pima-indians-diabetes.data.csv

In [None]:
!head -50 pima-indians-diabetes.data.csv

## Load CSV Files with the Python Standard Library

More info on `csv.reader()` can be found in the [CSV File Reading and Writing in the Python API](https://docs.python.org/2/library/csv.html).

In [None]:
import csv
import numpy as np

Note: file open options are documented [here](https://docs.python.org/3/library/functions.html#open).

In [None]:
filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rt')           # t=text mode open for r=reading (both are defaults)
reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE)
x = list(reader)
x

In [None]:
data = np.array(x).astype('float')
data

In [None]:
print(data.shape)

## Load CSV Files with NumPy

In [None]:
from numpy import loadtxt

More information on the `numpy.loadtxt()` function can be found on the [NumPy API documentation for loadtxt](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.loadtxt.html). The code above loads the file as a `numpy.ndarray`: more info on the [NumPy API documentation for ndarray](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.ndarray.html)). 

In [None]:
filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rt')
data = loadtxt(raw_data, delimiter=",")
print(data.shape)

## Load CSV Files with Pandas

Use the `pandas.read_csv()` function (more info [here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)). The function returns a `pandas.DataFrame` (more information on the [Pandas API documentation for DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) that one can immediately process, summarize, plot, etc.

In [None]:
from pandas import read_csv

In [None]:
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)

In [None]:
type(data)

In [None]:
print(data.shape)

In [None]:
data

## Summary

What we did:

* we discussed the need to import data
* we discussed the CSV format 
* we discussed peculiarities to check in the file before importing
* we familiarized with 3 ways to load data into Python (for ML purposes). We discussed why method 3 might be a good choice.

## What's next

It is time to start looking at the data we loaded. We will discover how to use simple descriptive statistics to better understand our data.