# Język python - wykład 12

Wykład w całości bazuje na warsztatach autorstwa [Mateusza Fliegera i Jacka Rzeszutka](https://github.com/Nozdi/first-steps-with-pandas-workshop).

# Welcome to the 'First steps with pandas'!

After this workshop you can (hopefully) call yourselves Data Scientists!

## What is pandas?

> pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

In [None]:
import platform
print('Python: ' + platform.python_version())

import numpy as np
print('numpy: ' + np.__version__)

import pandas as pd
print('pandas: ' + pd.__version__)

import scipy
print('scipy: ' + scipy.__version__)

import sklearn
print('scikit-learn: ' + sklearn.__version__)

import matplotlib as plt
print('matplotlib: ' + plt.__version__)

import flask
print('flask: ' + flask.__version__)

## Why to use it?

### It has ready solutions for most of data-related tasks
- rapid development
- code readability
- fewer mistakes/bugs

In [None]:
# In case of no Internet, use:
# pd.read_json('data/cached_Python.json') \

pd.read_json('http://stats.grok.se/json/en/201601/Python_(programming_language)') \
    .resample('1W') \
    .mean()['daily_views']

### It is reasonably fast

In [None]:
some_data = [ list(range(1,100)) for x in range(1,1000) ]
some_array = np.array(some_data).reshape(999, 99)
some_df = pd.DataFrame(some_data)


def standard_way(data):
    return [[x*2 for x in row] for row in data]

def numpy_way(array):
    return array * 2

def pandas_way(df):
    return df * 2

In [None]:
print(standard_way(some_data)[:2])
%timeit standard_way(some_data)

In [None]:
print(numpy_way(some_array))
%timeit numpy_way(some_array)

In [None]:
print(pandas_way(some_df).head())
%timeit pandas_way(some_df)

### It deals nicely with real data problems (e.g. missing data)

In [None]:
missing_data = pd.DataFrame([
    dict(name="Jacek", height=174),
    dict(name="Wiesiek", weight=81),
    dict(name="Lionel Messi", height=169, weight=67)
])
missing_data

In [None]:
missing_data.fillna(missing_data.mean())

### It has a very cool name.

![caption](W12_files/pandas.jpg)

> https://c1.staticflickr.com/5/4058/4466498508_35a8172ac1_b.jpg

###  Library highlights

http://pandas.pydata.org/#library-highlights<br/>
http://pandas.pydata.org/pandas-docs/stable/api.html

## Data structures

### Series

> Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).

In [None]:
strengths = pd.Series([400, 200, 300, 400, 500])
strengths

In [None]:
names = pd.Series(["Batman", "Robin", "Spiderman", "Robocop", "Terminator"])
names

### DataFrame

> DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

#### Creating

In [None]:
heroes = pd.DataFrame({
    'hero': names,
    'strength': strengths,
})
heroes

In [None]:
other_heroes = pd.DataFrame([
    dict(hero="Hercules", strength=800),
    dict(hero="Konan")
])
other_heroes

In [None]:
another_heroes = pd.DataFrame([
    pd.Series(["Bolek", 10, 3], index=["hero", "strength", "cookies"]),
    pd.Series(["Lolek", 20, 0], index=["hero", "strength", "cookies"])
])
another_heroes

#### Meta data

In [None]:
another_heroes.columns

In [None]:
another_heroes.shape

In [None]:
another_heroes.info()

#### Selecting
```[string] --> Series
[ list of strings ] --> DataFrame```

In [None]:
another_heroes['cookies']

In [None]:
another_heroes.cookies

In [None]:
a = another_heroes[ ['hero', 'cookies'] ]
print(a)
a['hero'][0]

#### Chaining (most of operations on DataFrame returns new DataFrame or Series)

In [None]:
another_heroes[['hero', 'cookies']][['cookies']]

In [None]:
another_heroes[['hero', 'cookies']][['cookies']]['cookies']

## I/O part I

### Reading popular formats / data sources

In [None]:
# Uncomment and press tab..
# pd.read_
# SQL, csv, hdf

In [None]:
pd.read_csv?

In [None]:
pd.read_json('W12_files/cached_python.json')

In [None]:
movies = pd.read_csv('L12_files/movies.csv')

In [None]:
len(movies.keys())

In [None]:
movies.keys()

In [None]:
movies.info()

In [None]:
len(movies.values)

## Filtering

In [None]:
movies

### Boolean indexing

In [None]:
movies['budget'] > 10000

In [None]:
movies[movies['budget'] < 1000]

In [None]:
heroes[heroes['strength'] > 400]
heroes['strength']

### Multiple conditions

In [None]:
heroes[200 < heroes['strength'] < 400]

In [None]:
movies[
    (movies['budget'] > 1200) & 
    (movies['budget'] < 4000)
]

In [None]:
heroes[
    (heroes['strength'] <= 200) |
    (heroes['strength'] >= 400)
]

### Negation

`~` is a negation operator

In [None]:
~(heroes['strength'] == 400)

In [None]:
heroes['strength'] != 400

In [None]:
heroes[~(
    (heroes['strength'] <= 200) |
    (heroes['strength'] >= 400)
)]

### Filtering for cointaining one of many values (SQL's IN)

In [None]:
heroes[
    heroes['hero'].isin(['Batman', 'Robin'])
]

## I/O part O

### As numpy array

In [None]:
heroes.values

### As (list) of dicts

In [None]:
heroes.to_dict()

In [None]:
heroes.to_dict('r')

### As popular data format

In [None]:
print(heroes.to_json())

In [None]:
print(heroes.to_json(orient='records'))

In [None]:
print(heroes.to_csv())

In [None]:
print(heroes.to_csv(index=False))

In [None]:
heroes.to_csv('data/heroes.csv', index=False)

## New columns

In [None]:
heroes

### Creating new column

In [None]:
heroes['health'] = np.NaN
heroes.head()

In [None]:
heroes['health'] = 100
heroes.head()

In [None]:
heroes['height'] = [180, 170, 175, 190, 185]
heroes

In [None]:
heroes['is_hungry'] = pd.Series([True, False, False, True, True])
heroes

### Vector operations

In [None]:
heroes['strength'] * 2

In [None]:
heroes['strength'] / heroes['height']

In [None]:
heroes['strength_per_cm'] = heroes['strength'] / heroes['height']
heroes

### Map, apply, applymap, str

In [None]:
pd.Series([1, 2, 3]).map(lambda x: x**3)

In [None]:
pd.Series(['Batman', 'Robin']).map(lambda x: x[:2])

In [None]:
pd.Series(['Batman', 'Robin']).str[:2]
# to szybciej działa - tak robic

In [None]:
pd.Series(['Batman', 'Robin']).str.lower()

In [None]:
pd.Series([
    ['Batman', 'Robin'],
    ['Robocop']
]).map(len)

In [None]:
heroes['code'] = heroes['hero'].map(lambda name: name[:2])
heroes

In [None]:
heroes['effective_strength'] = heroes.apply(
    lambda row: (not row['is_hungry']) * row['strength'],
    axis=1
)
heroes.head()

In [None]:
heroes[['health', 'strength']] = heroes[['health', 'strength']].applymap(
    lambda x: x + 100
)
heroes

#### Cheatsheet

```
map: 1 => 1
apply: n => 1
applymap: n => n
```

### Sorting and value counts (bonus skill)

In [None]:
heroes['strength'].value_counts()

In [None]:
heroes.sort_values('strength')

In [None]:
heroes.sort_values(
    ['is_hungry', 'code'],
    ascending=[False, True]
)

## Visualizing data

In [None]:
heroes

### Basic stats

In [None]:
heroes.describe()

### Plotting

In [None]:
%matplotlib inline

In [None]:
pd.Series([1, 2, 3]).plot()

In [None]:
pd.Series([1, 2, 3], index=['Batman', 'Robin', 'Rambo']).plot()

In [None]:
pd.Series([1, 2, 3], index=['Batman', 'Robin', 'Rambo']).plot(kind='bar')

In [None]:
pd.Series([1, 2, 3], index=['Batman', 'Robin', 'Rambo']).plot(
    kind='bar',
    figsize=(5, 2)
)
# cborn

In [None]:
pd.Series([1, 2, 3], index=['Batman', 'Robin', 'Rambo']).plot(kind='pie')

In [None]:
heroes.plot()

In [None]:
indexed_heroes = heroes.set_index('hero')
indexed_heroes

In [None]:
indexed_heroes.plot()

In [None]:
indexed_heroes.plot(kind='barh')

In [None]:
indexed_heroes.plot(kind='bar', subplots=True, figsize=(15, 15))

In [None]:
indexed_heroes[['height', 'strength']].plot(kind='bar')

In [None]:
heroes.plot(x='hero', y=['height', 'strength'], kind='bar')

In [None]:
# alternative to subplots
heroes.plot(
    x='hero',
    y=['height', 'strength'],
    kind='bar',
    secondary_y='strength',
    figsize=(10,8)
)

In [None]:
heroes.plot(
    x='hero',
    y=['height', 'strength'],
    kind='bar',
    secondary_y='strength',
    title='Super plot of super heroes',
    figsize=(10,4)
)

### Histogram

In [None]:
heroes.hist(figsize=(10, 10))

In [None]:
heroes.hist(
    figsize=(15, 7),
    bins=2
)
# seaborn

### DataFrames everywhere.. are easy to plot

In [None]:
heroes.describe()['strength'].plot(kind='bar')

## Aggregation

### Grouping

![caption](W12_files/split-apply-combine.jpg)

> https://www.safaribooksonline.com/library/view/learning-pandas/9781783985128/graphics/5128OS_09_01.jpg

In [None]:
movie_heroes = pd.DataFrame({
    'hero': ['Batman', 'Robin', 'Spiderman', 'Robocop', 'Lex Luthor', 'Dr Octopus'],
    'movie': ['Batman', 'Batman', 'Spiderman', 'Robocop', 'Spiderman', 'Spiderman'],
    'strength': [400, 100, 400, 560, 89, 300],
    'speed': [100, 10, 200, 1, 20, None],
})
movie_heroes = movie_heroes.set_index('hero')
movie_heroes

In [None]:
movie_heroes.groupby('movie')

In [None]:
list(movie_heroes.groupby('movie'))

### Aggregating

In [None]:
movie_heroes.groupby('movie').size()

In [None]:
movie_heroes.groupby('movie').count()

In [None]:
movie_heroes.groupby('movie')['speed'].sum()

In [None]:
movie_heroes.groupby('movie').mean()

In [None]:
movie_heroes.groupby('movie').apply(
    lambda group: group['strength'] / group['strength'].max()
)

In [None]:
movie_heroes.groupby('movie').agg({
    'speed': 'mean',
    'strength': 'max',
})

In [None]:
movie_heroes = movie_heroes.reset_index()
movie_heroes

In [None]:
movie_heroes.groupby(['movie', 'hero']).mean()

# Index related operations

### Data alignment on Index

In [None]:
movie_heroes

In [None]:
apetite = pd.DataFrame([
    dict(hero='Spiderman', is_hungry=True),
    dict(hero='Robocop', is_hungry=False)
])
apetite

In [None]:
movie_heroes['is_hungry'] = apetite['is_hungry']
movie_heroes

In [None]:
apetite.index = [2, 3]

In [None]:
movie_heroes['is_hungry'] = apetite['is_hungry']
movie_heroes

### Setting index

In [None]:
indexed_movie_heroes = movie_heroes.set_index('hero')
indexed_movie_heroes

In [None]:
indexed_apetite = apetite.set_index('hero')
indexed_apetite

In [None]:
# and alignment works well automagically..

indexed_movie_heroes['is_hungry'] = indexed_apetite['is_hungry']

In [None]:
indexed_movie_heroes

### Merging two DFs (a'la SQL join)

In [None]:
movie_heroes

In [None]:
apetite

In [None]:
# couple of other arguments available here

pd.merge(
    movie_heroes[['hero', 'speed']],
    apetite,
    on=['hero'],
    how='outer'
)

### DateTime operations

In [None]:
spiderman_meals = pd.DataFrame([
        dict(time='2016-10-15 10:00', calories=300),
        dict(time='2016-10-15 13:00', calories=900),
        dict(time='2016-10-15 15:00', calories=1200),
        dict(time='2016-10-15 21:00', calories=700),
        dict(time='2016-10-16 07:00', calories=1600),
        dict(time='2016-10-16 13:00', calories=600),
        dict(time='2016-10-16 16:00', calories=900),
        dict(time='2016-10-16 20:00', calories=500),
        dict(time='2016-10-16 21:00', calories=300),
        dict(time='2016-10-17 08:00', calories=900),
    ])
spiderman_meals

In [None]:
spiderman_meals.dtypes

In [None]:
spiderman_meals['time'] = pd.to_datetime(spiderman_meals['time'])
spiderman_meals.dtypes

In [None]:
spiderman_meals

In [None]:
spiderman_meals = spiderman_meals.set_index('time')
spiderman_meals

In [None]:
spiderman_meals.index

#### Filtering

In [None]:
spiderman_meals["2016-10-15"]

In [None]:
spiderman_meals["2016-10-16 10:00":]

In [None]:
spiderman_meals["2016-10-16 10:00":"2016-10-16 20:00"]

In [None]:
spiderman_meals["2016-10"]

#### Reasmpling (downsampling and upsampling)

In [None]:
spiderman_meals.resample('1D').sum()

In [None]:
spiderman_meals.resample('1H').mean()

In [None]:
spiderman_meals.resample('1H').ffill()

In [None]:
spiderman_meals.resample('1D').first()

## Advanced topics

### Filling missing data

In [None]:
heroes_with_missing = pd.DataFrame([
        ('Batman', None, None),
        ('Robin', None, 100),
        ('Spiderman', 400, 90),
        ('Robocop', 500, 95),
        ('Terminator', 600, None)
    ], columns=['hero', 'strength', 'health'])
heroes_with_missing

In [None]:
heroes_with_missing.dropna(subset=['health'])

In [None]:
heroes_with_missing.fillna(0.054)

In [None]:
heroes_with_missing.fillna(heroes_with_missing.min())

In [None]:
heroes_with_missing.fillna(heroes_with_missing.median())

### Scikit-learn

It can handle both Python objects and numpy arrays

In [None]:
from sklearn.linear_model import LinearRegression

clf = LinearRegression()
clf.fit(X=[
        [1],
        [2]
    ],
    y=[
        10,
        20
    ])

In [None]:
clf.predict([[3], [100], [1000]])

In [None]:
X = np.array([1, 2])[:,np.newaxis]
y = np.array([10, 20])

X

In [None]:
clf.fit(X, y)
clf.predict( np.array([3, 100, 1000])[:,np.newaxis] )

More models to try: http://scikit-learn.org/stable/supervised_learning.html#supervised-learning