# Data preprocessing tutorial

A crash course on data preprocessing using Pandas and Scikit-Learn.

## More resources

* https://en.wikipedia.org/wiki/Pandas_(software)
* https://www.kaggle.com/c/titanic/details/getting-started-with-python-ii
* https://www.safaribooksonline.com/library/view/python-for-data/9781449323592/ch04.html

## Why Pandas?

A quick demo... (Make sure you've ran `fetch_uci.py` in `ADSA/tutorial/datasets` first.)

## UCI Breast Cancer Dataset (breast.data)

* Easy dataset to start off with
* Dataset contains all continuous variables, except one ID column, and one label (M, B) column
* Goal of the dataset is to classify whether a tumor is maligant (M) or benigh (B)

In [34]:
import numpy as np

# Load in data using numpy
prefix = "../datasets/"
data = np.loadtxt(prefix + "breast.data", delimiter=",")

In [33]:
data = np.loadtxt(prefix + "breast.data", delimiter=",", dtype='O')
data

array([['842302', 'M', '17.99', ..., '0.2654', '0.4601', '0.1189'],
       ['842517', 'M', '20.57', ..., '0.186', '0.275', '0.08902'],
       ['84300903', 'M', '19.69', ..., '0.243', '0.3613', '0.08758'],
       ..., 
       ['926954', 'M', '16.6', ..., '0.1418', '0.2218', '0.0782'],
       ['927241', 'M', '20.6', ..., '0.265', '0.4087', '0.124'],
       ['92751', 'B', '7.76', ..., '0', '0.2871', '0.07039']], dtype=object)

## What's happening here?

Good things about Numpy:
* Vectorization - compiled code runs faster than interpreted code
* Syntax is intuitive (for the most part)

Limitations of Numpy:
* Can't handle multiple datatypes in an array
    * Existing solutions with numpy often forgos the speed boosts the library gives
* Limited support for anything regarding "structured data" (data in SQL tables)
    

## Why Pandas? (Part II)

* Dataframes allow for multiple datatypes, like SQL tables
    * Idea was taken from R
* Rows (the "index") can be indexed by keys (ex: strings), columns can be indexed by keys
    * Similar to SQL table (column manipulation, at least)
    * Not very conventional for numpy rows and columns to be indexed by keys (but can be done)
* Supports vectorization (We'll talk about this later)

In [10]:
import pandas as pd

df = pd.read_csv(prefix + "breast.data", sep=",")

In [11]:
df.head()

Unnamed: 0,842302,M,17.99,10.38,122.8,1001,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
0,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
1,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
2,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
3,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678
4,843786,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,...,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244


In [12]:
# Assumed the file had a header, use "header=None" to disable this
df = pd.read_csv(prefix + "breast.data", sep=",", header=None)

In [13]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [21]:
# I'd like to rename the columns, for readability
# R's syntax for renaming columns is weird: 
#     http://www.cookbook-r.com/Manipulating_data/Renaming_columns_in_a_data_frame/
df.rename(columns={0: "id", 1: "Tumor Status"}, inplace=True)

In [17]:
df.head()

Unnamed: 0,id,Tumor Status,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [19]:
# Drop the ids of the samples
# "axis = 1" means that "id" exists as a column, not a row
# See this post for more information about the definition of "axis"
#     http://stackoverflow.com/q/25773245/2014591
tumors = df.drop("id", axis=1)

In [20]:
tumors.head()

Unnamed: 0,Tumor Status,2,3,4,5,6,7,8,9,10,...,22,23,24,25,26,27,28,29,30,31
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [22]:
# Let's take a look at the tumors that are malignant (M)
malignant = tumors[tumors["Tumor Status"] == "M"]
benigh = tumors[tumors["Tumor Status"] == "B"]

In [23]:
malignant.head()

Unnamed: 0,Tumor Status,2,3,4,5,6,7,8,9,10,...,22,23,24,25,26,27,28,29,30,31
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## Integration with Numpy

Numpy is actually a dependency for Pandas, and to an extent. But you could also use Numpy arrays to index data.

In [30]:
# Example with shuffling an array
perm = np.random.permutation(len(tumors))

# We're going to use the "iloc" indexing method here. More information
# about indexing is here.
#     http://pandas.pydata.org/pandas-docs/stable/indexing.html
# Because of time constraints, I don't want to dive too deep into it.
tumors = tumors.iloc[perm]

In [None]:
# Notice the order on the index has changed
tumors.head(20)

## (Mostly) seamless integration with Scikit-Learn

The Scikit-Learn project fully acknowledges that Pandas is a powerful library for data analysis. Hence, you could pass in DataFrames (and Pandas Series) into Scikit-Learn

We'll talk about how to model more with Scikit-Learn tomorrow.

In [32]:
X = tumors.drop("Tumor Status", axis=1)
y = tumors["Tumor Status"]

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)
predictions = LogisticRegression().fit(X_train, y_train).predict(X_test)

print "Accuracy of logistic regression: ", accuracy_score(y_test, predictions)

Accuracy of logistic regression:  0.965034965035
