# Load and Understand Your Machine Learning Data With Descriptive Statistics
Machine Learning Mastery With Python: Understand Your Data, Create Accurate Models and Work Projects End-To-End
by
Jason Brownlee

Migrated to Jupyter with additions by Mitch Sanders 5/15/2017

## How To Load Machine Learning Data
You must be able to load your data before you can start your machine learning project. The
most common format for machine learning data is CSV files. There are a number of ways to
load a CSV file in Python. In this lesson you will learn three ways that you can use to load
your CSV data in Python:
1. Load CSV Files with the Python Standard Library.
2. Load CSV Files with NumPy.
3. Load CSV Files with Pandas.

Let’s get started

### Considerations When Loading CSV Data
There are a number of considerations when loading your machine learning data from CSV files.
For reference, you can learn a lot about the expectations for CSV files by reviewing the CSV
request for comment titled Common Format and MIME Type for Comma-Separated Values
(CSV) Files
.
#### File Header
Does your data have a file header? If so this can help in automatically assigning names to each
column of data. If not, you may need to name your attributes manually. Either way, you should
explicitly specify whether or not your CSV file had a file header when loading your data.
#### Comments
Does your data have comments? Comments in a CSV file are indicated by a hash (#) at the
start of a line. If you have comments in your file, depending on the method used to load your
data, you may need to indicate whether or not to expect comments and the character to expect
to signify a comment line.
#### Delimiter
The standard delimiter that separates values in fields is the comma (,) character. Your file could
use a different delimiter like tab or white space in which case you must specify it explicitly.
#### Quotes
Sometimes field values can have spaces. In these CSV files the values are often quoted. The
default quote character is the double quotation marks character. Other characters can be used,
and you must specify the quote character used in your file.


### Pima Indians Dataset
The Pima Indians dataset is used to demonstrate data loading in this lesson. It will also be used
in many of the lessons to come. This dataset describes the medical records for Pima Indians
and whether or not each patient will have an onset of diabetes within five years. As such it
is a classification problem. It is a good dataset for demonstration because all of the input
attributes are numeric and the output variable to be predicted is binary (0 or 1). The data is
freely available from the UCI Machine Learning Repository - https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

### Load CSV Files with the Python Standard Library
The Python API provides the module CSV and the function reader() that can be used to load
CSV files. Once loaded, you can convert the CSV data to a NumPy array and use it for machine
learning. For example, you can download3
the Pima Indians dataset into your local directory
with the filename pima-indians-diabetes.data.csv. All fields in this dataset are numeric
and there is no header line.

In [None]:
# Load CSV Using Python Standard Library
import csv
import numpy
filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rb')
reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE)
x = list(reader)
data = numpy.array(x).astype('float')
print(data.shape)

In [None]:
type(raw_data)

### Load CSV File with NumPy to an ndarray Array Type
You can load your CSV data using NumPy and the numpy.loadtxt() function. This function
assumes no header row and all data has the same format. The example below assumes that the
file pima-indians-diabetes.data.csv is in your current working directory.


In [None]:
# Load CSV using NumPy
from numpy import loadtxt
filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rb')
data = loadtxt(raw_data, delimiter=",")
print(data.shape)

In [None]:
type(data)

In [None]:
# Load CSV from URL using NumPy
from numpy import loadtxt
from urllib import urlopen
url = 'https://goo.gl/vhm1eU'
raw_data = urlopen(url)
dataset = loadtxt(raw_data, delimiter=",")
print(dataset.shape)

### Load CSV File with Pandas
You can load your CSV data using Pandas and the pandas.read csv() function. This function
is very flexible and is perhaps my recommended approach for loading your machine learning
data. The function returns a pandas.DataFrame7
that you can immediately start summarizing
and plotting. The example below assumes that the pima-indians-diabetes.data.csv file is
in the current working directory.

In [None]:
# Load CSV using Pandas
from pandas import read_csv
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
print(data.shape)

In [None]:
type(data)

In [None]:
# Load CSV using Pandas from URL
from pandas import read_csv
url = 'https://goo.gl/vhm1eU'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(url, names=names)
print(data.shape)

In [None]:
# Class Distribution
from pandas import read_csv
filename = "pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
class_counts = data.groupby('class').size()
print(class_counts);

Generally I recommend that you load your data with Pandas in practice and all subsequent examples will use this method.


Documentation:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

https://docs.python.org/2/library/csv.html

http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.ndarray.html

http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.loadtxt.html

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html


## Understand Your Data With Descriptive Statistics
You must understand your data in order to get the best results. In this chapter you will discover
7 recipes that you can use in Python to better understand your machine learning data. After
reading this lesson you will know how to:
1. Take a peek at your raw data.
2. Review the dimensions of your dataset.
3. Review the data types of attributes in your data.
4. Summarize the distribution of instances across classes in your dataset.
5. Summarize your data using descriptive statistics.
6. Understand the relationships in your data using correlations.
7. Review the skew of the distributions of each attribute.
Each recipe is demonstrated by loading the Pima Indians Diabetes classification dataset
from the UCI Machine Learning repository. Open your Python interactive environment and try
each recipe out in turn. 

Let’s get started.


### Peek at Your Data
There is no substitute for looking at the raw data. Looking at the raw data can reveal insights
that you cannot get any other way. It can also plant seeds that may later grow into ideas on
how to better pre-process and handle the data for machine learning tasks. You can review the
first 20 rows of your data using the head() function on the Pandas DataFrame.


In [None]:
# View first 20 rows
from pandas import read_csv
filename = "pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
peek = data.head(20)
print(peek)

In [None]:
?numpy

### Dimensions of Your Data
You must have a very good handle on how much data you have, both in terms of rows and
columns.
- Too many rows and algorithms may take too long to train. Too few and perhaps you do
not have enough data to train the algorithms.
- Too many features and some algorithms can be distracted or suffer poor performance due
to the curse of dimensionality.

You can review the shape and size of your dataset by printing the shape property on the
Pandas DataFrame.

In [None]:
# Data Types for Each Attribute
from pandas import read_csv
filename = "pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
types = data.dtypes
print(types)

### Descriptive Statistics
Descriptive statistics can give you great insight into the shape of each attribute. Often you can
create more summaries than you have time to review. The describe() function on the Pandas
DataFrame lists 8 statistical properties of each attribute. They are:
- Count.
- Mean.
- Standard Deviation.
- Minimum Value.
- 25th Percentile.
- 50th Percentile (Median).
- 75th Percentile.
- Maximum Value.

In [None]:
# Statistical Summary
from pandas import read_csv
from pandas import set_option
filename = "pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
set_option('display.width', 100)
set_option('precision', 3)
description = data.describe()
print(description)


You can see that you do get a lot of data. You will note some calls to pandas.set option()
in the recipe to change the precision of the numbers and the preferred width of the output. This
is to make it more readable for this example. When describing your data this way, it is worth
taking some time and reviewing observations from the results. This might include the presence
of NA values for missing data or surprising distributions for attributes.

### Class Distribution (Classification Only)
On classification problems you need to know how balanced the class values are. Highly imbalanced
problems (a lot more observations for one class than another) are common and may need special
handling in the data preparation stage of your project. You can quickly get an idea of the
distribution of the class attribute in Pandas.


In [None]:
# Class Distribution
from pandas import read_csv
filename = "pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
class_counts = data.groupby('class').size()
print(class_counts)

## Correlations Between Attributes
Correlation refers to the relationship between two variables and how they may or may not
change together. The most common method for calculating correlation is Pearson’s Correlation
Coefficient, that assumes a normal distribution of the attributes involved. A correlation of -1
or 1 shows a full negative or positive correlation respectively. Whereas a value of 0 shows no
correlation at all. Some machine learning algorithms like linear and logistic regression can suffer
poor performance if there are highly correlated attributes in your dataset. As such, it is a good
idea to review all of the pairwise correlations of the attributes in your dataset. You can use the
corr() function on the Pandas DataFrame to calculate a correlation matrix.


In [None]:
# Pairwise Pearson correlations
from pandas import read_csv
from pandas import set_option
filename = "pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
set_option('display.width', 100)
set_option('precision', 3)
correlations = data.corr(method='pearson')
print(correlations)

### Skew of Univariate Distributions
Skew refers to a distribution that is assumed Gaussian (normal or bell curve) that is shifted or
squashed in one direction or another. Many machine learning algorithms assume a Gaussian
distribution. Knowing that an attribute has a skew may allow you to perform data preparation
to correct the skew and later improve the accuracy of your models. You can calculate the skew
of each attribute using the skew() function on the Pandas DataFrame.


In [None]:
# Skew for each attribute
from pandas import read_csv
filename = "pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
skew = data.skew()
print(skew)


### Tips To Remember
This section gives you some tips to remember when reviewing your data using summary statistics.
- Review the numbers. Generating the summary statistics is not enough. Take a moment
to pause, read and really think about the numbers you are seeing.
- Ask why. Review your numbers and ask a lot of questions. How and why are you seeing
specific numbers. Think about how the numbers relate to the problem domain in general
and specific entities that observations relate to.
- Write down ideas. Write down your observations and ideas. Keep a small text file or
note pad and jot down all of the ideas for how variables may relate, for what numbers
mean, and ideas for techniques to try later. The things you write down now while the
data is fresh will be very valuable later when you are trying to think up new things to try

### Summary
In this section you discovered how to load your machine learning data in Python. You learned
three specific techniques that you can use:
- Load CSV Files with the Python Standard Library.
- Load CSV Files with NumPy.
- Load CSV Files with Pandas.

You also discovered the importance of describing your dataset before you start work
on your machine learning project. You discovered 7 different ways to summarize your dataset
using Python and Pandas:
- Peek At Your Data.
- Dimensions of Your Data.
- Data Types.
- Class Distribution.
- Data Summary.
- Correlations.
- Skewness.

### Next
Another excellent way that you can use to better understand your data is by generating plots
and charts. In the next lesson you will discover how you can visualize your data for machine
learning in Python.

### About the Pima Indian Dataset

#### Attribute Information:

1. Number of times pregnant 
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 
3. Diastolic blood pressure (mm Hg) 
4. Triceps skin fold thickness (mm) 
5. 2-Hour serum insulin (mu U/ml) 
6. Body mass index (weight in kg/(height in m)^2) 
7. Diabetes pedigree function 
8. Age (years) 
9. Class variable (0 or 1) 