# Pandas Data Analisys Library

+ Author: Alexandre Manhães Savio
+ Date: 08/Jun/2015
+ Reference: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro

**pandas** is an open source, BSD-licensed library providing 

+ high-performance, easy-to-use data structures and 

+ data analysis tools for the Python programming language.

In [6]:
from IPython.core.display import HTML
HTML("<iframe src=http://pandas.pydata.org width=900 height=400></iframe>")

In [38]:
from IPython.core.display import HTML
HTML("<iframe src=http://www.scipy.org/ width=900 height=400></iframe>")

## How to install

In [39]:
!pip install pandas



## How to import

In [None]:
import pandas as pd

## Versions

In [1]:
import pandas as pd
from pandas import Series, DataFrame
pd.__version__

'0.16.1'

## Intro to data structures

### Series

+ [Series](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html#pandas.Series) is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). 


+ The axis labels are collectively referred to as the index. 


+ The basic method to create a Series is to call:

In [None]:
s = Series(data, index=index)

Here, `data` can be many different things:

+ a Python `dict`
+ an `ndarray`
+ a scalar value (like 5)

The passed `index` is a list of axis labels.

#### `np.ndarray`

If `data` is an **ndarray**, `index` must be the same length as data.

In [81]:
import numpy as np

s = Series(np.random.randn(5), index=['a', 'b', 'c', 'i', 'e'])
s

a   -0.230892
b    1.079832
c    0.000219
i    1.185617
e   -1.396949
dtype: float64

In [9]:
s = Series(np.random.randn(5))
s

0    0.408527
1    0.787688
2   -0.439678
3    0.668323
4   -2.138073
dtype: float64

#### `dict`

If `data` is a **dict**, if `index` is passed the values in data corresponding to the labels in the index will be pulled out. 

In [31]:
d = {'a' : 0., 'b' : 1., 'c' : 2.}
Series(d)

a    0
b    1
c    2
dtype: float64

In [32]:
Series(d, index=['b', 'c', 'd', 'a'])

b     1
c     2
d   NaN
a     0
dtype: float64

#### **From `scalar` value** 

If `data` is a scalar value, an `index` must be provided.

In [15]:
Series(5., index=['a', 'b', 'c', 'd', 'e'])

a    5
b    5
c    5
d    5
e    5
dtype: float64

#### Series is `ndarray`-like

In [16]:
s[0]

0.44662078315873394

In [17]:
s[:3]

a    0.446621
b   -0.693499
c   -1.340619
dtype: float64

In [18]:
s[s > s.median()]

a    0.446621
d    1.207876
dtype: float64

In [19]:
np.exp(s)

a    1.563021
b    0.499824
c    0.261684
d    3.346368
e    0.471108
dtype: float64

#### Series is `dict`-like

In [20]:
s['a']

0.44662078315873394

In [21]:
s['e'] = 12.
s

a     0.446621
b    -0.693499
c    -1.340619
d     1.207876
e    12.000000
dtype: float64

In [23]:
s.get('f', 0)

0

#### Vectorized operations and label alignment with Series

In [24]:
s + s

a     0.893242
b    -1.386998
c    -2.681238
d     2.415751
e    24.000000
dtype: float64

In [25]:
s * 2

a     0.893242
b    -1.386998
c    -2.681238
d     2.415751
e    24.000000
dtype: float64

#### Okay. Where is the difference?

In [26]:
s[1:] + s[:-1]

a         NaN
b   -1.386998
c   -2.681238
d    2.415751
e         NaN
dtype: float64

The result of an operation between unaligned Series will have the **union** of the indexes involved.

## Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research.

### One more thing...

In [36]:
s.name = 'my_lovely_series'
s

a     0.446621
b    -0.693499
c    -1.340619
d     1.207876
e    12.000000
Name: my_lovely_series, dtype: float64

### DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.

+ `Dict` of 1D `ndarrays`, `lists`, `dicts`, or Series
+ 2-D `numpy.ndarray`
+ Structured or record `ndarray`
+ A Series
+ Another DataFrame

Along with the data, you can optionally pass **index** (row labels) and **columns** (column labels) arguments. 

If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame.

Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

#### From `dict` of `ndarray`/`lists`

In [49]:
d = {'one' : [1., 2., 3., 4.],
     'two' : [4., 3., 2., 1.]}
DataFrame(d)

Unnamed: 0,one,two
0,1,4
1,2,3
2,3,2
3,4,1


In [52]:
DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1,4
b,2,3
c,3,2
d,4,1


#### From `dict` of `Series`

In [54]:
d = {'one' : Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

df = DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


In [43]:
DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Unnamed: 0,two,three
d,4,
b,2,
a,1,


In [45]:
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [46]:
df.columns

Index(['one', 'two'], dtype='object')

#### From a `list` of dicts

In [55]:
data2 = [{'a': 1, 'b': 2}, 
         {'a': 5, 'b': 10, 'c': 20}]
DataFrame(data2)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [56]:
DataFrame(data2, index=['first', 'second'])

Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


In [57]:
DataFrame(data2, columns=['a', 'b'])

Unnamed: 0,a,b
0,1,2
1,5,10


#### From structured or record array

In [67]:
data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])

In [68]:
data

array([(0, 0.0, b''), (0, 0.0, b'')], 
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [75]:
data[:] = [(1,2.,'Hello'),(2,3.1,"World")]

In [76]:
DataFrame(data)

Unnamed: 0,A,B,C
0,1,2.0,b'Hello'
1,2,3.1,b'World'


In [77]:
DataFrame(data, index=['first', 'second'])

Unnamed: 0,A,B,C
first,1,2.0,b'Hello'
second,2,3.1,b'World'


In [78]:
DataFrame(data, columns=['A', 'C', 'B'])

Unnamed: 0,A,C,B
0,1,b'Hello',2.0
1,2,b'World',3.1


#### Other constructors

+ `DataFrame.from_dict`: takes a dict of dicts or a dict of array-like sequences.

+ `DataFrame.from_records`: takes a list of tuples or an ndarray with structured dtype. 

+ `DataFrame.from_items`: takes a sequence of (key, value) pairs.