# Pandas

[Pandas](http://pandas.pydata.org "Pandas Home Page") is used to manipulate, clean, and query data by looking at the Pandas data tool kit. Pandas was created by Wes McKinny in 2008 and is an open source project under a very permissive license.

Two recommended books on Pandas:
* Python for Data Analysis
* Learning the Pandas Library

## The Series Data Structure

In [31]:
import pandas as pd

The **series** is one of the core data structures in pandas. The series is similar to a cross between a dictionary and a list. Items are stored in an order and there's labels with which you can retreive them.  
An easy visualization is two coluns of data, The first is the special index, a lot like the dictionary keys. The second is your actual data.

You can view the documentation of the series object by placing a `?` after the init call.

In [6]:
pd.Series?

The documentation tells us we can pass in some data, an index and a name.  
The Data can be anything that is array-like, like a list.

In [7]:
# create list of animals
animals = ['Tiger', 'Bear', 'Moose']
# cast list as a Series
pd.Series(animals)

0    Tiger
1     Bear
2    Moose
dtype: object

Here we see pandas recognized the type of data being held in the list. Here we passed in a list of strings and pandas set the type to object.  

We don't have to use strings. If we passed in a list of whole number, for instance, we can see that pandas sets the type to int64

In [9]:
numbers = [1, 2, 3]
pd.Series(numbers)

0    1
1    2
2    3
dtype: int64

Pandas stores series values in a typed array using the numpi library. This offers significant performance improvements over traditional python list

Pandas will do type conversion for values such as `None` depending on the other values in the series.

Here a series of strings with a `None` maintains it's object type as well as the `None` element.  
The numbers series get's a type float64 and changes the `None` to an `NaN`

In [10]:
animals = ['Tiger', 'Bear', None]
pd.Series(animals)

0    Tiger
1     Bear
2     None
dtype: object

In [11]:
numbers = [1, 2, None]
pd.Series(numbers)

0    1.0
1    2.0
2    NaN
dtype: float64

`NaN` is not equal to `None`.  
`NaN` is not equal to `NaN`.  
You need to use special functions to test for the presence of a not a number, such as the numpi library `isnan()` function.

In [32]:
import numpy as np
np.nan == None

False

In [33]:
np.nan == np.nan

False

In [34]:
np.isnan(np.nan)

True

If you make a series object from a dictionary, the indexes become the keys, rather than integer numbers.

In [35]:
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

In [36]:
s.index

Index(['Archery', 'Golf', 'Sumo', 'Taekwondo'], dtype='object')

## Querying a Series

A panda series can be queried by the index position of the index label.  
**To quey by the numeric location, starting at zero, use the `iloc()` attribute.**  
_`iloc()` ~ index location_  
**To query by the index label, ou can use the `loc()` attribute**  
_`loc()` ~ label location_

In [37]:
# create dict of countries national sports
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

In [38]:
# to see the 4th county in the list we can use the iloc attribute with 3
s.iloc[3]

'South Korea'

In [39]:
# to find which country has golf as its nation sport we use the loc attribute
s.loc['Golf']

'Scotland'

**Note:** `iloc()` and `loc()` are not methods, they are attributes. So you don't use parentheses to query them, but square backets instead, which we'll call the **indexing operator**

Pandas Series allows a smart syntax by using the indexing operator directly on the series itself.  
If you pass an integer parameter, the operator will behave as if you want to query via the `iloc()` attribute. If you pass an object, it will query as if you wanted to use the label based `loc()` attibute.  

In [40]:
s[3]

'South Korea'

In [41]:
s['Golf']

'Scotland'

This becomes dangerous when indexing objects are themselves integers. Then the smart syntax wont know which attribute to use. So it is usually best to explicitly call out which type of indexing you are using with the appropriate attibute.  

For example, here we have integer based label system in the dictionary below

In [42]:
sports = {99: 'Bhutan',
          100: 'Scotland',
          101: 'Japan',
          102: 'South Korea'}
s = pd.Series(sports)

In [43]:
s[0]

KeyError: 0

Often we want to a common task to apply accross all elements in the series

In [44]:
s = pd.Series([100.00, 120.00, 101.00, 3.00])
s

0    100.0
1    120.0
2    101.0
3      3.0
dtype: float64

We could iterate over each item in the series

In [45]:
total = 0
for item in s:
    total+=item
print(total)

324.0


Pandas libraries support a method of computation called **vectorization**.  
Vectorization works with most the functions in the NumPy library, including the sum function.

In [46]:
import numpy as np

total = np.sum(s)
print(total)

324.0


`%%timeit`
Now we want to determine which of these is faster.  
The Jupyter Notebook has a function `%timeit` which will time how long it takes to execute the cell.  
**Note:** All magic functions begin with a percentage sign.
You can press tab after `%` to see what magic functions are available. Cellular magic functions have two % signs after them.

In [60]:
#this creates a big series of random numbers
s = pd.Series(np.random.randint(0,1000,10000))
s.head()

0    806
1    314
2     34
3    802
4    792
dtype: int64

In [56]:
len(s)

10000

In [57]:
%%timeit -n 100
summary = 0
for item in s:
    summary+=item

100 loops, best of 3: 1.21 ms per loop


In [64]:
%%timeit -n 100
summary = np.sum(s)

100 loops, best of 3: 41.2 µs per loop


As we can see using the vectorized functions perform much faster.
This is why it is important for Data Scientists to be aware of parallel computing futures and start thinking in functional programming terms.

Let's try counting the time to add 2 to each element.  
Once using vectorization  
Once using a loop

We will run the calculation 10 times to increase our timing disparity.

In [65]:
%%timeit -n 100
summary = 0
for item in s:
    summary+=item

100 loops, best of 3: 1.21 ms per loop


In [67]:
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
for label, value in s.iteritems():
    s.loc[label]= value+2

10 loops, best of 3: 816 ms per loop


It is quite clear again that vectorization wins.