In [None]:
from __future__ import division, print_function

# to suppress warnings of Seaborn's deprecated usage of Matplotlib
import warnings
warnings.filterwarnings("ignore")

<center>
<h1>The Python Data Sience Stack</h1>
<br>
<h3>Dr. Florian Wilhelm</h3>
<h4>Senior Professional: Data Scientist @ CSC</h4>
</center>

# What is Python?

* multi-purpose
* focused on readability and productivity
* easy to learn
* object oriented
* interpreted
* strongely and dynamically typed
* cross platform

# Features

* indentation is part of the syntax 
* high level data types (tuples, lists, dictionaries, sets)
* Python Standard Library (Batteries included)
  * string sevices, regular expressions
  * mathematical modules
  * IO, file formats and data persistence
  * OS, threading, multiprocessing
  * networking, email, html, webbrowser
  * ...
* easily extensible with C/C++ (glue language)
* tons of external libraries

# Why Python for Analytics?

Besides the features already mentioned, Python has:

* large communities for data science, analytics, visualisation etc.,
* many and well-established libraries,
* lots of examples and documentation,
* **huge** demand from the industry.

<img src="./pics/jobgraph.png">

<center>
<h1> Python 2 vs. 3</h1>
<img src="./pics/python-2-vs-3.jpg" width=80%><br style="clear:both"/>
Source: <a href=http://learntocodewith.me/programming/python/python-2-vs-python-3/>LearnToCodeWithMe</a>
</center>

<center>
<h1>Installation</h1>
</center>

### Linux & Mac 

<img src="./pics/linux.jpeg" align="left" width=13%><br style="clear:both"/>
*It is already installed!* Use [virtualenv](http://virtualenv.readthedocs.org/) and [pip](https://pip.pypa.io/) to setup isolated environments and install more packages. [Conda](http://conda.pydata.org/docs/) is an alternative.

### Windows 
<img src="./pics/anaconda.png" align="left" width=15%><br style="clear:both"/>

*A bit trickier!* Best use the [Anaconda distribution](https://www.continuum.io/downloads) from Continuum Analytics to install everything you need to get going.


# Primer on Python

Strong and dynamically typed

In [None]:
x = 23
3*x

In [None]:
x = "Hello "
y = "World!"
print(x + y)

In [None]:
print(x + 1)

## Indentation matters!

In [None]:
x = 3

if x > 0:
    if x % 2 == 0:
        print("Positive, even number!")
    else:
        print("Positive, odd number!")
else:
    print("Non-positive number!")

In [None]:
def bmi(height, weight):
    return weight / height**2

print("The BMI is: {:.3}".format(bmi(1.85, 79)))

## Tuples

In [None]:
x = (1, 3, 5)
print(x)

In [None]:
x[2]

In [None]:
a, b, c = x
print(a + b + c)

## Lists

In [None]:
x = [1, 3, 5]
print(x)

In [None]:
x.append(7)
print(x)

In [None]:
del x[0]
print(x)

## Dictionaries

In [None]:
x = {'a': 1, 'b': 2, 'c': 3}

In [None]:
print(x['b'])

In [None]:
x['d'] = 4
print(x)

In [None]:
x['dispatch'] = lambda x: x + 1
x['dispatch'](1)

Powerful and easy to use data structures like lists and dictionaries allow **declarative programming**.

## Loops and list comprehension

In [None]:
x = []
for i in range(5):
    x.append(i**2)
print(x)

In [None]:
# better
x = [i**2 for i in range(5)]
print(x)

Many more *high-level* concepts available to express an algorithm as *natural* as possible.  

<center><h1>Python Data Science Stack</h1>
<br>
<img src="./pics/stack_empty.png" width=90%>
</center>

<center><h1>Python Data Science Stack</h1>
<br>
<img src="./pics/stack_full.png" width=90%><br style="clear:both"/>
</center>

<center><h1>Python Data Science Stack</h1></center>


* **NumPy** to work efficiently with multi-dimensional arrays and matrices. Includes some high-level mathematical operations.
* **SciPy** extends NumPy with additional modules (optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers etc.).
* **Pandas** builds upon NumPy and provides high-performance, easy-to-use data structures and data analysis tools.
* **Scikit-Learn** provides simple and efficient machine learning tools for data mining and data analysis.
* **matplotlib** provides 2d plotting capabilities. Use additionally **Seaborn** for statistical plots.
* **IPython** is a powerful interactive shell and a kernel for *Jupyter*.
* **Jupyter** Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. 

<center><h1>Jupyter/IPython Notebook</h1>
<img src="./pics/notebook.png" widht=60%><br style="clear:both"/>
</center>

<center><h1>Titanic: Analysis of a Disaster</h1>
<img src="./pics/titanic.jpg" widht=60%><br style="clear:both"/>
Painting from <a href=https://en.wikipedia.org/wiki/Sinking_of_the_RMS_Titanic>Willy Stöwer</a>, source & more information: <a href=https://www.kaggle.com/c/titanic>Kaggle</a>
</center>

## Setting things up and reading in the data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df = pd.read_csv('./input/train.csv')

In [None]:
# Drop unnecessary colum and rename some for better readability
df = df.drop('PassengerId', axis=1)
df = df.rename(columns={'Survived': 'Alive', 'Pclass': 'Class', 'Embarked': 'Port'})
df['Name'] = df['Name'].str[:10]

In [None]:
df.head()

## Preprocessing

In [None]:
# We drop some hard to use columns and define 'Port', 'Sex' and 'Class' as categories
df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)
df['Port'] = df['Port'].astype('category')
df['Sex'] = df['Sex'].astype('category')
df['Class'] = df['Class'].astype('category')
df.head()

In [None]:
df.shape

## Data cleansing

In [None]:
df.describe()

In [None]:
df[['Sex', 'Port']].describe()

In [None]:
# Fill not available observations
age_mean = df['Age'].mean()
df['Age'] = df['Age'].fillna(age_mean)
df['Port'] = df['Port'].fillna('S')

In [None]:
df[['Sex', 'Port']].describe()

## Some analysis plots

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 5))
ax.axes.set_xlim(0, 80)
g = sns.distplot(df['Age'], color="b", ax=ax)

In [None]:
# Draw a nested barplot to show survival for class and sex
g = sns.factorplot(x="Class", y="Alive", hue="Sex", data=df, size=7, kind="bar", palette="muted")
g.set_ylabels("survival probability")
g.set_xlabels("passenger class")

## Fitting a simple predictive model

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split

# Define features and target variables
X = df.drop('Alive', axis=1)
y = df['Alive']
# Convert categories to integer values
X['Sex'] = X['Sex'].cat.codes
X['Port'] = X['Port'].cat.codes
# Convert to NumPy arrays
X = X.values
y = y.values

In [None]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
# Create and train the model
model = RandomForestClassifier(n_estimators = 100, random_state=0)
model = model.fit(X_train, y_train)

In [None]:
preds = model.predict(X_test)
print("Accuracy: {:.1%}".format(np.mean(preds == y_test)))

<center><h1>Questions?</h1><br>
<img src=./pics/light-bulb.jpg align="center" width=40%/></center>

# Credits

<br>
This presentation was inspired by:

* [Savarin's PyCon UK Tutorial](http://nbviewer.jupyter.org/github/savarin/pyconuk-introtutorial/blob/master/notebooks/Section%201-1%20-%20Filling-in%20Missing%20Values.ipynb)
* [Thomas Wiecki's Introduction to Data Analysis with Python](http://nbviewer.jupyter.org/github/twiecki/pydata_ninja/blob/master/PyData%20Ninja.ipynb)
* [Kaggle's Titantic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/)