# Introduction

A crash course on data preprocessing using Pandas and Scikit-Learn.

For more information about Jupyter notebooks, see [here]("http://jupyter-notebook-beginner-guide.readthedocs.org/en/latest/what_is_jupyter.html").

## More resources

* [Kaggle Tutorial for python](https://www.kaggle.com/c/titanic/details/getting-started-with-python-ii) (highly recommended)
* [Safari Online](https://www.safaribooksonline.com/library/view/python-for-data/9781449323592/ch04.html)

## Why Pandas?

A quick demo... (Make sure you've ran 
```
$ python fetch_uci.py
```
in `ADSA/tutorial/datasets` first.)

## Demo with UCI Breast Cancer Dataset (breast.data)

* Easy dataset to start off with
* Dataset contains all continuous variables, except one ID column, and one label (M, B) column
    * The continous variables are just statistics collected from a tumor's biopsy
    * More information can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names)
* Goal of the dataset is to classify whether a tumor is maligant (M) or benigh (B)

In [None]:
import numpy as np

# Load in data using numpy
prefix = "../datasets/"
data = np.loadtxt(prefix + "breast.data", delimiter=",")

In [None]:
data = np.loadtxt(prefix + "breast.data", delimiter=",", dtype='O')
data

## What's happening here?

Good things about Numpy:
* Vectorization - compiled code runs faster than interpreted code
* Syntax is intuitive (for the most part)

Limitations of Numpy:
* Can't handle multiple datatypes in an array
    * Existing solutions with numpy often forgos the speed boosts the library gives
* Limited support for anything regarding "structured data" (data in SQL tables)
    

## Why Pandas? (Part II)

* Dataframes allow for multiple datatypes, like SQL tables
    * Idea was taken from R
* Rows (the "index") can be indexed by keys (ex: strings), columns can be indexed by keys
    * Similar to SQL table (column manipulation, at least)
    * Not very conventional for numpy rows and columns to be indexed by keys (but can be done)
* Supports vectorization (We'll talk about this later)

In [None]:
import pandas as pd

df = pd.read_csv(prefix + "breast.data", sep=",")

In [None]:
df.head()

In [None]:
# Assumed the file had a header, use "header=None" to disable this
df = pd.read_csv(prefix + "breast.data", sep=",", header=None)

In [None]:
df.head()

In [None]:
# I'd like to rename the columns, for readability
# R's syntax for renaming columns is weird: 
#     http://www.cookbook-r.com/Manipulating_data/Renaming_columns_in_a_data_frame/
df.rename(columns={0: "id", 1: "Tumor Status"}, inplace=True)

In [None]:
df.head()

In [None]:
# Drop the ids of the samples
# "axis = 1" means that "id" exists as a column, not a row
# See this post for more information about the definition of "axis"
#     http://stackoverflow.com/q/25773245/2014591
tumors = df.drop("id", axis=1)

In [None]:
tumors.head()

In [None]:
# Let's take a look at the tumors that are malignant (M)
# Syntax is similar to numpy.
malignant = tumors[tumors["Tumor Status"] == "M"]
benigh = tumors[tumors["Tumor Status"] == "B"]

In [None]:
malignant.head()

## Integration with Numpy

You could also use Numpy arrays to index data.

In [None]:
# Example with shuffling an array
perm = np.random.permutation(len(tumors))

# We're going to use the "iloc" indexing to index here. More information
# about indexing is here.
#     http://pandas.pydata.org/pandas-docs/stable/indexing.html
# Because of time constraints, I don't want to dive too deep into it.
tumors = tumors.iloc[perm]

In [None]:
# Notice the order on the index has changed
tumors.head(20)

## Seamless\* integration with Scikit-Learn

The Scikit-Learn project fully acknowledges that Pandas is a powerful library for data analysis. Hence, you could pass in DataFrames (and Pandas Series) into Scikit-Learn

We'll talk about how to model more with Scikit-Learn tomorrow.

<sub>\* There are some functions from Scikit-Learn that I would argue do *not* neatly integrate well with Pandas dataframes. But these are mostly corner cases.</sub>

In [None]:
X = tumors.drop("Tumor Status", axis=1)
y = tumors["Tumor Status"]

from sklearn.svm import SVC  # Support vector machine that performs classification
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)
predictions = SVC().fit(X_train, y_train).predict(X_test)

print "Accuracy of SVM: ", accuracy_score(y_test, predictions)