# Data Analysis & Prediction
David Gasquez & Diego Hueltes

## Tools
Useful tools for data analysis
- [Overview](#overview)
- [Jupyter](#jupyter)
- [Pandas](#pandas)
- [Seaborn](#seaborn)
- [bcolz / bquery](#bcolz)
- [Other tools](#other-tools)

## Data pipeline
Modeling workflow
- [Preprocesing](#preprocesing)
- [Analysis](#analysis)
- [Modeling](#modeling)
- [Model deployment](#deployment)

## Models
Diferent kind of models
- [Linear](#linear)
- [Trees](#trees)

## Demo
Demo time!
- [Demo](#demo)

# Tools

<a id='overview'></a>
## Overview

<img src="https://avatars3.githubusercontent.com/u/7388996" width="400">
<img src="http://www.numfocus.org/uploads/6/0/6/9/60696727/6893890_orig.png" width="600">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Created_with_Matplotlib-logo.svg/2000px-Created_with_Matplotlib-logo.svg.png" width="300">
<img src="https://github.com/Blosc/bcolz/raw/master/doc/bcolz.png" width="300">
<img src="https://github.com/visualfabriq/bquery/raw/master/bquery.png" width="300">
<img src="https://avatars2.githubusercontent.com/u/365630?v=3&s=400" width="400">

<a id='jupyter'></a>
# Jupyter

In [None]:
1 + 1

In [None]:
two = 1 + 1

In [None]:
print map(lambda x: "Happy Birthday to " + ("you" if x != two else "Me"), range(3))

In [None]:
p = lambda x: ( -13214 * x**11 + 956318 * x**10 - 30516585 * x**9 + 564961485 * x**8
                - 6717043212 * x**7 + 53614486464 * x**6 - 291627605005 * x**5
                + 1074222731065 * x**4 - 2606048429424 * x**3 + 3927289106268 * x**2
                - 3265905357360 * x + 1116073728000 ) / 19958400

print bytearray(map(p, range(1, 13)))

In [None]:
def f(x):
    return (x * 5) * (x - 5) / (x + 5) + x + x * 42

In [None]:
%time my_list = [f(x) for x in range(5000000)]

In [None]:
import multiprocessing
%time my_list = list(multiprocessing.Pool(processes=4).map(f, range(5000000)))

In [None]:
!ls

In [None]:
!while true; do head -c200 /dev/urandom | od -An -w40 -x | grep -E --color "([[:alpha:]][[:digit:]]){2}"; sleep 2; done

<a id='pandas'></a>
# Pandas

In [None]:
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = (15, 5)
pd.set_option('display.width', 5000) 
pd.set_option('display.max_columns', 60)

In [None]:
complaints = pd.read_csv('data/requests.csv')

In [None]:
print(complaints)

In [None]:
complaints

## Why columnar is faster?
<img src="img/columnar.png" width="1000">

In [None]:
complaints['Complaint Type']

In [None]:
complaints[:5]

In [None]:
complaints['Complaint Type'][:5]

In [None]:
complaints[:5]['Complaint Type']

In [None]:
complaints[['Complaint Type', 'Borough']][:10]

In [None]:
complaints['Complaint Type'].value_counts()

In [None]:
complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10]

In [None]:
complaint_counts[:10].plot(kind='bar')

In [None]:
noise_complaints = complaints[complaints['Complaint Type'] == "Noise - Street/Sidewalk"]
noise_complaints[:3]

In [None]:
mask = complaints['Complaint Type'] == "Noise - Street/Sidewalk"

In [None]:
mask

In [None]:
is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk"
in_brooklyn = complaints['Borough'] == "BROOKLYN"
complaints[is_noise & in_brooklyn][:5]

In [None]:
noise_complaint_counts = noise_complaints['Borough'].value_counts()
complaint_counts = complaints['Borough'].value_counts()

In [None]:
noise_complaint_counts / complaint_counts

In [None]:
(noise_complaint_counts / complaint_counts.astype(float)).plot(kind='bar')

In [None]:
bikes = pd.read_csv('data/bikes.csv', sep=';', encoding='latin1', parse_dates=['Date'], dayfirst=True, index_col='Date')

In [None]:
bikes

In [None]:
bikes['Berri 1'].plot()

In [None]:
berri_bikes = bikes[['Berri 1']].copy()
berri_bikes[:5]

In [None]:
berri_bikes.index.weekday

In [None]:
berri_bikes['weekday'] = berri_bikes.index.weekday
berri_bikes

In [None]:
weekday_counts = berri_bikes.groupby(['weekday'])['Berri 1'].sum()
weekday_counts

In [None]:
weekday_mean = berri_bikes.groupby(['weekday'])['Berri 1'].mean()
weekday_mean

In [None]:
weekday_counts.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekday_counts

In [None]:
weekday_counts.plot(kind='bar')

In [None]:
weekday_counts = berri_bikes.groupby(['weekday'], as_index=False)['Berri 1'].sum()
weekday_counts

In [None]:
mean_bikes = weekday_counts['Berri 1'].mean()
weekday_counts['level'] = weekday_counts['Berri 1'].apply(lambda x: 'Greater' if x >= mean_bikes
                                                                  else 'Lower')
weekday_counts

<a id='bcolz'></a>
# bcolz / bquery

<img src="http://www.rtcmagazine.com/files/images/2702/EditorsReport_fig1_large.jpg">

> bcolz provides columnar, chunked data containers that can be compressed either in-memory and on-disk

http://nbviewer.jupyter.org/github/Blosc/movielens-bench/blob/master/querying-ep14.ipynb
https://github.com/visualfabriq/bquery/blob/master/bquery/benchmarks/bench_groupby.ipynb

In [None]:
bikes = bikes.reset_index()
bikes = bikes.rename(columns={column: 'c{}'.format(i) 
                              for i, column in enumerate(bikes.columns)})

In [None]:
from bcolz import ctable
import bcolz
ct = ctable.fromdataframe(bikes)
ct

In [None]:
[row.c0 for row in ct.where('c1 > 35')]

<a id='seaborn'></a>
# Seaborn

In [None]:
%matplotlib inline
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks")

In [None]:
rs = np.random.RandomState(11)
t = rs.gamma(2, size=1000)
s = -.5 * x + rs.normal(size=1000)

In [None]:
plt.scatter(t, s)
plt.show()

In [None]:
sns.jointplot(t, s, stat_func=None, size=10)
plt.show()

In [None]:
df = sns.load_dataset("iris")
sns.pairplot(df, hue="species")

# Models

<a id='linear'></a>
## Linear models and SVM

### OLS, WLS
<img src="http://strijov.com/sources/img/demo_GLM_2.png">
### SVM
<img src="http://1.bp.blogspot.com/-CD6nja2DNDY/VgTft5YhWiI/AAAAAAAADEo/W7eTpexZ0fI/s1600/svm-predicted-classification-3-ring-data-resized-600.png">

# Data pipeline

<img src="img/data_pipeline.png" width=600px>

<a id='modeling'></a>
## Modeling

In [None]:
from sklearn import datasets, ensemble
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
import numpy as np
def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

dataset = datasets.load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.2)

In [None]:
regressor = ensemble.ExtraTreesRegressor()
regressor.fit(X_train, y_train)

In [None]:
predictions = regressor.predict(X_test)
print ("R²:", r2_score(y_test, predictions))
print ("MAPE:", mean_absolute_percentage_error(y_test, predictions))

In [None]:
clf = ensemble.ExtraTreesRegressor()

regressor = GridSearchCV(
    clf,
    {
        'max_depth': [None, 20, 25, 30],
        'max_features': [None, 'auto', 10],
        'n_estimators': [350, 400],
    },
    cv=10,
    n_jobs=-1,
    verbose=1,
    scoring='r2'
)

regressor.fit(X_train, y_train)

<a id='deployment'></a>
# Model deployment

In [None]:
from sklearn.externals import joblib

In [None]:
joblib.dump(regressor, 'test.pkl')
loaded_regressor = joblib.load('test.pkl')

predictions = loaded_regressor.predict(X_test)
print ("R²:", r2_score(y_test, predictions))
print ("MAPE:", mean_absolute_percentage_error(y_test, predictions))