# Everything Data (IGERT Bootcamp, Day 3)

*Instructor:* [Luca Foschini](www.lucafoschini.com) (email: luca@evidation.com) (twitter: @calimagna)

*Format:* Lecture and hands-on

## Goals
- Learn how perform basic data manipulation with python
- See all the things that python can do
- Learn about what makes your code run slow
- Do you really have big data? 

## Data Ingestion, Wrangling, ETL

  - 80% of Data Science is data wrangling. 
  - Python's library ecosystem is the first reason to use it!
  - Pandas: if you learn one thing today, learn this!


### Everything has a Python API

It's safe to say that every internet service has an API for Python:

Examples: 
 - Weather : https://github.com/csparpa/pyowm
 - Twitter: https://code.google.com/p/python-twitter/ 
 - Fitbit: https://github.com/orcasgit/python-fitbit
 
### Many Domain Specific Libraries

 - Natural Language Processing: http://www.nltk.org/ [Run the tutorial], see also : http://fbkarsdorp.github.io/python-course/
 - Graphs:  http://networkx.readthedocs.io/en/networkx-1.11/examples/drawing/ego_graph.html
 - Machine Learning: http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html [Run the example]
 
### Exotic:
 - Deep Learning: https://keras.io/
 - Survival analysis: https://github.com/CamDavidsonPilon/lifelines
 - Bayesian inference and MCMC: http://pymcmc.readthedocs.org/en/latest/
 

In [5]:
# Example 1:
# do something fun with the weather API

## Data Wrangling with Python and Pandas (tutorial)

Introduction: http://pandas.pydata.org/pandas-docs/stable/10min.html

Tutorial on data wrangling:

https://github.com/jvns/pandas-cookbook

In [None]:
# Run some exploration on tutorial
%matplotlib inline
import pandas as pd
import matplotlib
matplotlib.style.use('ggplot')
#montreal weather
weather_url = "https://raw.githubusercontent.com/jvns/pandas-cookbook/master/data/weather_2012.csv"

weather_2012_final = pd.read_csv(weather_url, parse_dates='Date/Time', index_col='Date/Time')
weather_2012_final['Temp (C)'].plot(figsize=(15, 6))

## Why is my code slow?

  - Look under the hood: Memory hiearchies.
  - Python is magic, magic isn't free: how built-in types are implemented and efficiency consideration
  - Profiling and monitoring
  - If everything else fails: go parallel. 

### Example of vectorization and timing

http://nbviewer.jupyter.org/github/rossant/ipython-minibook/blob/master/chapter3/301-vector-computations.ipynb


In [None]:
# Run the example above

def closest(position, positions):
    x0, y0 = position
    dbest, ibest = None, None
    for i, (x, y) in enumerate(positions):
        d = (x - x0) ** 2 + (y - y0) ** 2
        if dbest is None or d < dbest:
            dbest, ibest = d, i
    return ibest

### One benchmark a day
Goldmine: https://github.com/rasbt/One-Python-benchmark-per-day/tree/master/

Try: 
    
- [6 different ways for counting elements using a dictionary](http://nbviewer.jupyter.org/github/rasbt/One-Python-benchmark-per-day/blob/master/ipython_nbs/day3_dictionary_counting.ipynb)

- [Python vs Cython vs Numba](http://nbviewer.jupyter.org/github/rasbt/One-Python-benchmark-per-day/blob/master/ipython_nbs/day4_python_cython_numba.ipynb)

## Memory, cores, I/O
  - [Latency](https://gist.github.com/jboner/2841832): Register, Cache, RAM, Disk (SSD/HDD), network
  - Out of core vs distributed
  - Embarrassingly parallel problems (shell/python parallel)
  
  ### Scale:
- Multiprocess : http://sebastianraschka.com/Articles/2014_multiprocessing_intro.html [Run the tutorial]
- Parallel : http://nbviewer.ipython.org/gist/ogrisel/5115540/Model%20Selection%20for%20the%20Nystroem%20Method.ipynb
- Big Data (Spark and BDAS) https://spark.apache.org/examples.html


## How to deal with big data?

  - be smart: (sampling/approximation algorithms, divide-and-conquer)
  - be rich: [rent-a-cloud](https://aws.amazon.com/ec2/pricing/), [Digital Ocean](https://www.digitalocean.com/), [Cloud9](https://c9.io/pricing)
  
## Data Exploration

- Work with text, networks, time series. 
  - Examples, miniprojects, resources.


  - [Interesting notebook gallery](https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks#introductory-tutorials). Pick one!
  - [Miniproject](..//Day02_EverythingData/notebooks/07%20-%20Miniproject.ipynb)

