# Pandas NumPy Intro

### Objectives
After this lesson you should be able to...
+ Get help by knowing your object, reading documentation and using inline commands
+ Know why a pandas is more suitable for data analysis than Python lists
+ Identify a Series as a single dimensional data structure with an **index** and **values**
+ Know the difference between an **index** and **values**
+ Know the difference between attributes and methods and how they are accessed
+ Create a Series with the constructor using **index** and **values** keyword arguments
+ Access Series items by integer position with **`.iloc`**
+ Access Series items by index label with **`.loc`**
+ Always use **`.iloc`** or **`.loc`** for accessing Series elements
+ Know why using the brackets **`[]`** to access elements is undesired
+ Know that the indexes automatically align when two Series objects are added (or any operation) together
+ Be familiar with basic Series methods and attributes

### Prepare for this lesson by...
[ALWAYS READ THE DOCUMENTATION BEFORE A LESSON!](http://pandas.pydata.org/pandas-docs/stable/)
+ Read the [Package Overview](http://pandas.pydata.org/pandas-docs/stable/overview.html)
+ Read [Intro to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html) - **just the Series section**
+ Read [Indexing and Selecting](http://pandas.pydata.org/pandas-docs/stable/indexing.html) - **up to but not including Selection By Callable**

# Welcome to ....
![](https://dataaspirant.files.wordpress.com/2014/10/pandas.png)

### What is Pandas?
Pandas is possibly the best open source data exploration library available currently available. It gives the user tremendous power to easily explore, manipulate, query, aggregate, visualize, `<insert cool sounding data word>`, etc... tabular (row, column) data.

### Why Pandas and not xyz?
In this current age of data explosion, there are now many dozens of other tools that can essentially do many, if not more, than what the pandas library can do. However, there are many aspects of pandas that set it apart and it continues to have one of the fastest growing user bases.
1. It's a Python library, which makes it easy to read, easy to develop, and easily integrates with other popular Python libraries like matplotlib, numpy, and scikit-learn.
2. It is nearly self-contained in that tremendous functionality is built in one package. This contrasts with R, where many packages are needed to obtain the same functionality.
3. The community is amazing. Looking at stackoverflow, for example, there are [nearly 23,000](http://stackoverflow.com/questions/tagged/pandas) pandas questions. SAS, a multi-billion dollar revenue analytics software maker has only 6k questions. This is one huge benefit of open source in general. If you need help, you are nearly guaranteed to find it very quickly. After a while most of your questions will be answered in the top 3 search results from Google.
4. Lightning fast development. New features are added all the time thanks to the huge community. This contrasts with propriety software which can never move as fast.
5. Powerful, simple, amazing community!!!

### Why is it named after an east Asian bear?
Pandas was built by a young guy named Wes McKinney beginning in 2008 at a hedge fund named AQR. Finance speak is to call tabular data 'panel data' which smashed together becomes pandas. If you are really interested in the history, you can hear it from the creator [himself](https://www.youtube.com/watch?v=kHdkFyGCxiY)

### Python already has data structures to handle data, why do we need another one?
Even though Python itself is a high level language, its primary built-in data structures - lists and dicts - do not easily lend themself to tabular data in ways that humans can easily visualize them and do vectorized (no for loops) operations. Just summing up items in a list can be quite slow.

### NumPy
NumPy (think of 'numeric' and 'python' smashed together) is the most popular third-party Python library for scientific computing and forms the foundation for dozens of others. NumPy's primary data structure is an n-dimensional array which allows for very fast computation needed in scientific computing. See the example below showcasing the speed difference between summing a list vs a numpy array of 1 million numbers.

In [1]:
# create a list of 1 million
n = 1000000
my_list = list(range(n))

### Timing Code Execution with %%timeit
%%timeit is a cell magic command and execute all code in the code cell and return the time it took to complete execution.

In [2]:
%%timeit
sum1 = sum(my_list)

100 loops, best of 3: 8.67 ms per loop


In [3]:
# if you've downloaded anaconda then you already have numpy
import numpy as np

In [4]:
# create array with arange function.
array = np.arange(n)

In [5]:
%%timeit 
sum2 = np.sum(array)

1000 loops, best of 3: 447 Âµs per loop


### What just happened?
iPython comes with handy dandy magic commands that give you some great extra functionality. The one I use the most is timeit which times the length of the operation. Precede it by % for a single line magic and %% for entire cell magic. Using the builtin sum function with a list took approximately 20 times longer than using numpy's array and this was just a simple sum of a list of numbers. This difference increases with complexity of the operation performed on the data.

### Why is numpy so fast?
Numpy ndarray operations are executed in precompiled C code which makes for much faster execution times. A python list in contrast must be iterated through at run-time, can take any number of different types and so is not well suited to do large numerical computations. 

### Why not numpy?
Though numpy is fast and can handle most of our data needs, it still is relatively low-level.  For example, the ndarray is just a brick of numbers. The main building block in pandas, the ndframe is built directly upon numpy's building block (the ndarray). Pandas allows much easier access to rows and columns, powerful statistical functionality, enhanced merging and grouping and many more data manipulation abilities. We will not delve into the specifics of numpy, but remembering that pandas building blocks consists of numpy building blocks as its base is useful. More info on numpy can be found [in the docs](https://docs.scipy.org/doc/numpy/user/index.html). We actually will be using some numpy functionality but it should be self explanatory when it comes up.

### Quick Aside on Magic Commands
For more advanced usage of iPython like the %timeit magic command used above, [visit the iPython magic docs](http://ipython.readthedocs.io/en/stable/interactive/magics.html). See a list of all the magic commands below with **%lsmagic** command which lists all of them.

In [6]:
# Want to view all the magical abilites?
# view all the magic commands
%lsmagic

Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%perl  %%prun  %%pypy  %%python  %%python2  %%python3

### Pandas is simple: There's really only one type of data structure
Data representation is simple. It's just plopped in what most people would call a table. Rows, columns, thats it. 

![Boring Table](http://www.homeandlearn.co.uk/powerpoint/images/charts/8Table9.gif) 

You've seen this every day of your life. So no more explaining. 

Well, not exactly.  There are numerous formats for data (XML, json, raw bytes, etc...), but for our purposes, we will only be examining what everyone thinks of when the they think of data - a table.

Pandas is built just for analyzing this tabular, rectangular, very deceptively normal concept of data. There are two primary objects that account for everything we will be covering. 

**The Series and the DataFrame.**

The **Series** is a very simple pandas object. It is a single column of data with an **index** that references each element of the Series. The index is **very** important in pandas and what separates itself from data in numpy. Much more attention will be given to the index later.

The **DataFrame** is a collection of Series (columns) and forms your normal concept of a table with rows and columns. Again, the **index** is very important. Both the rows and columns have an **index** that references them.

For now, think of the **index** as a name that can reference a particular row or column of data.

## Series: A manual build
We will start our first lines of pandas by constructing a Series, a single dimensional array with an index.

In [5]:
# import pandas and alias it to pd as is convention
import pandas as pd

In [8]:
# to construct a Series, simply pass it a python list
s = pd.Series([1,3,3,77])
s

0     1
1     3
2     3
3    77
dtype: int64

### Examining Output of the Series
A 4 item series was created and stored to the variable **`s`**. Several more things were printed to the screen. To the left of the created series is the index beginning at 0. This is the default index when none is given during the construction of the Series object. Also printed out is the data type (dtype) of the series `int64` which stands for a 64 bit integer

### Attributes and Methods of a Series
Remembering that every single variable in Python is an object and all objects have attributes and methods it's interesting to see everything that a Series is capable of doing. The **dir** function accesses all the attributes and methods for every Python object

In [9]:
# As was done earlier in the precoure, print out all the methods/attributes of the series object
print(dir(s))

['T', '_AXIS_ALIASES', '_AXIS_IALIASES', '_AXIS_LEN', '_AXIS_NAMES', '_AXIS_NUMBERS', '_AXIS_ORDERS', '_AXIS_REVERSED', '_AXIS_SLICEMAP', '__abs__', '__add__', '__and__', '__array__', '__array_prepare__', '__array_priority__', '__array_wrap__', '__bool__', '__bytes__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__div__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__int__', '__invert__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__le__', '__len__', '__long__', '__lt__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pow__', '__radd__', '__rand__', '__rdiv__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmod__', '__rmul__', '__ror__', '__round__', '__rpow__', '__rsub__', '__rtruediv__', '

### Help?
Those are an unbelievable amount of attributes and methods. Its not likely that you will have them handily remembered to use at any time in the future so be prepared to get help when using pandas, especially in the beginning. Luckily for us help is quite easy to get. Pandas has excellent [documentation](http://pandas.pydata.org/pandas-docs/stable/index.html) with plenty of examples that you should go through completely if you want to know just about every single feature the library has to offer. That link is always to the newest (stable) pandas version.

To get help inside the notebook use the help function to have the help printed to the screen or put a '?' at the end of the method to bring up a separate help window on the bottom. To view source code put '??' at the end. You can also directly view the source code in your file system by navigating to /path/to/anaconda/lib/python3.5/site-packages/pandas/core/base.py

And one more excellent way to get help and my personal favorite - is to press shift + tab + tab once inside the parentheses of a method.

In [10]:
## Get source code for value_counts method
s.value_counts??

### Counting the data
Counting data is one of the most common things to do with it. The **`value_counts`** method counts the occurrence of each data in a Series and orders it by the most frequent.

In [11]:
# count and order the occurrences of each datum
s.value_counts()

3     2
77    1
1     1
dtype: int64

### Value Counts Explanation
The output looks very straightforward. The value 3 occurred twice with 77 and 1 each occurring once. End of story. Well, not really. Do you know what type of object was returned from this operation? Take a guess.

Let's store the result to a variable and see the resulting type.

In [12]:
# examine what is returned from value_counts
s_vc = s.value_counts()

type(s_vc)

pandas.core.series.Series

### Knowing Your Results
Even though the output returned from **`value_counts`** may have been obvious for anyone to interpret, it is crucial to know exactly what was returned. A new **Series** object is created that has values of 2, 1, 1. The **index** of this series has values 3, 77, 1.

# Always Know What Type of Object is Returned!
Its imperative to always know the type of object your are working with is. To be certain, use the **`type`** function to find out.

Every value from an object's dot notation always returns some kind of object and again it's extremely important to know what type of object is returned. Every element of a list, tuple, set, dictionary, pandas Series or DataFrame is an object itself and is important to know what the object's type is.

It should be obvious that knowing an object's type is the first thing you should know about it, but this can easily be overlooked when working with more complex Python libraries like pandas.

Similarly to how you must know if you are looking at a bird or cat or car or human, you need to know what type of object you have so that you can know what to do with it and what it's capable of doing.

There will be many examples shown below dealing with understanding object types.

In [13]:
# simple types. boolean, integer, float, string
# use the type function to output the types
a = True
b = 5
c = 9.45
d = 'asdf'

type(a), type(b), type(c), type(d)

(bool, int, float, str)

In [14]:
# You can create complex (imaginary) numbers by using the letter j
a = 9 + 5j

type(a)

complex

In [2]:
# None is an object with no public attributes or methods
a = None
type(a)

NoneType

In [16]:
# not all keywords are objects. Only things that are able to be assigned to a variable are objects
type(for)

SyntaxError: invalid syntax (<ipython-input-16-019e0d4fffc8>, line 2)

In [17]:
# not all keywords are objects
type(in)

SyntaxError: invalid syntax (<ipython-input-17-e71e8c95fe74>, line 2)

In [18]:
# are functions objects? Yes!
a = max

type(a)

builtin_function_or_method

In [19]:
# lists, sets, tuples, dictionaries
a = [1, 2, 3]
b = {1, 5}
c = (6, 9, True)
d = {'a': 1, 54: 'adsf'}

type(a), type(b), type(c), type(d)

(list, set, tuple, dict)

In [20]:
# get the attributs/methods of a dictionary. What is the type returned?
d = {'a': 1, 54: 'adsf'}
dictionary_abilities = dir(d)

type(dictionary_abilities)

list

In [3]:
# how about range
a = range(10)

type(a)

range(0, 10)


In [22]:
# How about Python modules? Yes, they are their own types
import math

type(math)

module

### Most Common Python Built-in Types
This covers defining the most common Python built-in types: boolean, integer, float, complex, string, list, set, tuple, dictionary, range, modules and None.

Take a look at the [built-in types section](https://docs.python.org/3/library/stdtypes.html) in the Python documentation for a much more detailed breakdown of the different types.

Continuing with a few more examples

In [23]:
# dictionary with values of many different types
d = {11: 'prime', 'composites':[4, 6, 8], 'a': True}

print(d['composites'])

type(d['composites']) # the value mapped to the 'composites' key is a list

[4, 6, 8]


list

In [24]:
# lists of objects
a = [max, 1, [4, 5, min], {'a': 1}]

print(type(a))

type(a[2]), type(a[3]) # lists and dictionaries in a list

<class 'list'>


(list, dict)

### Examining the Series: The Index and the Values
    A pandas Series is composed of the index and the values. Both the index and the values are themselves unique objects and have their own types. The index is a pandas object and it's type is Index.  The values are a NumPy object with type ndarray. You can access these objects directly from a Series using the **index** and **values** attributes using dot notation.

In [6]:
# recreate the Series then access its index and values
s = pd.Series([1,3,3,77])

In [7]:
# get the index
s_index = s.index

s_values = s.values

# output the object types
type(s_index), type(s.values)

(pandas.indexes.range.RangeIndex, numpy.ndarray)

Technically, in this particular case the index is another pandas object called the **RangeIndex** which functions similarly to the **range** built-in function but that distinction is not important for now. The important point is that the index is a pandas object and the values are a NumPy object.

In [27]:
# lets look at the values of the index. Use dot notation again
s_index.values

array([0, 1, 2, 3])

In [28]:
# can use dot notation twice starting from the original Series to get the values of the index
s.index.values

array([0, 1, 2, 3])

### Technical Note on Series Construction
The command **`pd.Series([1, 3, 3, 7])`** appears as though the function **Series** from the pandas module is being invoked. All previous code in these notebooks that ended in parentheses was calling a function.  Technically, **`pd.Series([1, 3, 3, 7])`** is not a function. It is creating an **instance** of the **Series** class by passing a 4 element list to the Series constructor.

For practical purposes of this notebook, there are no differences between a class constructor and a function that returns an object. For example, **`np.arange(100)`** is technically a function that returns a numpy ndarray and  **`pd.Series([1, 3, 3, 7])`** calls the Series constructor and creates a pandas Series object.

Very typically, class names will be capitalized just like Series and function will not be like **`arange`**.

In [29]:
type(np.arange), type(pd.Series)

(builtin_function_or_method, type)

### Construct a series with a given index
Series can be constructed by passing it a specific user-defined index. The arguments **data** and **index** can be used to explicity define each.

In [30]:
# construct series by explicitly defining the values and the index
s = pd.Series(data=[1,3,3,77], index=['a', 'b', 'c', 'd'])
s

a     1
b     3
c     3
d    77
dtype: int64

In [31]:
# create a series with an index of different data types
# almost never done in practice but shows what is possible
s = pd.Series(data=[1,3,3,77], index=[{'a', 'b'}, ('b','c'), range(10), 'd'])
s

{a, b}                             1
(b, c)                             3
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)     3
d                                 77
dtype: int64

In [32]:
s.index

Index([{'a', 'b'}, ('b', 'c'), (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), 'd'], dtype='object')

### Simple Mathematical Operations on Series
It's very easy to perform simple math operations to a series.

In [33]:
# construct a Series
s = pd.Series(data=[1,3,3,77], index=['a', 'b', 'c', 'd'])

In [34]:
# add 5 to each element
s + 5

a     6
b     8
c     8
d    82
dtype: int64

In [35]:
# divide each element by 5
s / 5

a     0.2
b     0.6
c     0.6
d    15.4
dtype: float64

### Lists cannot do this
If you have only worked with python lists and dictionaries for containing data then the last code blocks should amaze you. Although its obvious what has happened this simple functionality does not come out of the box with a python list.

In [36]:
# the plus symbol is used to concatenate two lists together
my_list = list(range(4))
my_list + 5

TypeError: can only concatenate list (not "int") to list

Since 5 is an int and not a list, an error is thrown. Python has no idea that you would like to add 5 to each of the elements. Pandas natively understands that you would like to actually add 5 to every element. Concatenating to a Series is a bit more troublesome than concatenating to a python list (more on this later).

### Vectorized Operations
Pandas/numpy are filled with vectorized operations going on all the time. A vectorized operation is one where a sequence of numbers is operated on without the explicit writing of for loops. They are handled outside of python in precompiled C/Fortran code that has been optimized a long time ago. Vectorized operations allow you to execute many operations with ease that normally take a very long time to write through normal iterative methods in python.

# <del>LOOPS</del>
Because of vectorization, we can say goodbye to loops. If you are writing loops in pandas, you are probably doing it wrong.

In [37]:
# A slew of other mathematical operations are able to be performed on a python series
# raise every element to a power and continue doing element by element math
s ** 4 / 13 - 40

a   -3.992308e+01
b   -3.376923e+01
c   -3.376923e+01
d    2.704040e+06
dtype: float64

In [38]:
# doing the same with a list would require a list comprehension
my_list = [1, 3, 3, 37]
[element ** 4 / 13 - 40 for element in my_list]

[-39.92307692307692,
 -33.76923076923077,
 -33.76923076923077,
 144126.23076923078]

### Series Methods Exploration
There are dozens of Series methods that allow tremendous power. Some very basic ones are covered below. Remember that methods always follow the dot notation from your object. Press **tab** after the dot to see the entire list of available methods in a menu and **shift + tab + tab** to see the help menu.

In [39]:
s.add(5) # add 5 using a method

a     6
b     8
c     8
d    82
dtype: int64

In [40]:
s.sum() #sum up all the items

84

In [41]:
# sort the values from least to greaters
s.sort_values()

a     1
b     3
c     3
d    77
dtype: int64

In [42]:
# many methods have very valuable arguments that can be set to get a different result. 
# Here we sort from largest to smallest
s.sort_values(ascending=False)

d    77
c     3
b     3
a     1
dtype: int64

In [43]:
# more aggregations
# these all return single numbers and Not a Series object
s.std(), s.var(), s.min(), s.max(), s.mean(), s.median(), s.mode() 

(37.345236197762446, 1394.6666666666667, 1, 77, 21.0, 3.0, 0    3
 dtype: int64)

In [44]:
# execute operations that return the entire data set
# keep track of sum as you iterate down the column
s.cumsum()

a     1
b     4
c     7
d    84
dtype: int64

In [45]:
# keep track of product as you iterate down the column
s.cumprod()

a      1
b      3
c      9
d    693
dtype: int64

### Head and Tail Methods
Series objects can occasionally be extremely large and not a good choice to print on the screen. The **`head`** and **`tail`** methods allow to quickly inspet the first or last elements of a Series

In [46]:
# create a large Series
s = pd.Series(np.arange(500))

In [47]:
# Use head method - defaults to first 5
s.head(5)

0    0
1    1
2    2
3    3
4    4
dtype: int64

In [48]:
# can pass a parameter to get more/fewer elements
s.head(9)

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
dtype: int64

In [49]:
# tail gets the last n elements
s.tail(3)

497    497
498    498
499    499
dtype: int64

### Accessing individual elements of a Series
Grabbing elements in a Series uses the brackets [ ] operator similarly to lists. Below are some examples of accessing elements in a pandas Series.

In [50]:
# define a series
s = pd.Series(data=[1,3,3,77, 100, -9], index=['a', 'b', 'c', 'd', 'e', 'f'])

s

a      1
b      3
c      3
d     77
e    100
f     -9
dtype: int64

In [51]:
# get the first element
# series are zero indexed
s[0]

1

In [52]:
# get the last element
s[-1]

-9

In [53]:
# using a comma separated list to get
# returns a Series and not just the element
s[[0, 3, 5]]

a     1
d    77
f    -9
dtype: int64

In [54]:
# using a slice
s[3:]

d     77
e    100
f     -9
dtype: int64

In [55]:
# slice with a step size
s[2:6:2]

c      3
e    100
dtype: int64

### Massive confusion about to begin
So you are thinking, getting elements and slices from a Series is quite trivial: Just use the same notation as with lists. This does work when the index of your Series is all characters or integers starting at 0.

When the Series index has a mix of integers and characters or has integers that do not start with 0 or non-contiguous integers then there will be problems.

See the confusing examples below

In [56]:
# construct a Series with index beginning at 2
s = pd.Series(data=[1,3,3,77, 100, -9], index=range(2,8))
s

2      1
3      3
4      3
5     77
6    100
7     -9
dtype: int64

In [57]:
# attempt to access the first element of the Series
s[0]

KeyError: 0

### Integer Location vs Label Location
It turns out that access to Series elements are not exactly like accessing elements in a list. Lists always use the **integer location** of the element. The **Series** allows you to use the **label location** as well as the **integer location**.

More explicitly, **integer location** is the 0-indexed integer location of the element in the Series and works the exact same as with lists.

**Label Location** uses the label of the **index** to locate the desired element. A Python dictionary works identically by looking up a **key** to find a value. This is the above elicited a **KeyError**. The Series attempted to find the label **0** and it was not there.

### .iloc[ ] vs .loc[ ] vs [ ]
Pandas has some esoteric rules whether **label location** or **integer location** will be used when applying the brackets operator. Because of this ambiguity I highly recommend using either **.iloc** or **.loc** to select elements in a Series. 

**.iloc** - stands for **integer location** and will always use integer location to select elements.

**.loc** - always used for **label location** to select elements

In [58]:
s = pd.Series(data=[1,3,3,77, 100, -9], index=['a', 'b', 'c', 'd', 'e', 'f'])

In [59]:
# use .iloc to specify only integer location
s.iloc[4]

100

In [60]:
s.iloc[2]

3

In [61]:
# try a location that does not exist
# yields IndexError and not KeyError
s.iloc[9]

IndexError: single positional indexer is out-of-bounds

In [62]:
# get specific integer locations
s.iloc[[2, 5, 1]]

c    3
f   -9
b    3
dtype: int64

In [63]:
# slice
s.iloc[1:5]

b      3
c      3
d     77
e    100
dtype: int64

In [64]:
# slice with step
s.iloc[::2]

a      1
c      3
e    100
dtype: int64

In [65]:
# Now for label location. Use the name of the label to reference each selection
s.loc['a']

1

In [66]:
s.loc['d']

77

In [67]:
# try and use an integer
# Its looking for a character since index is all characters
s.loc[0]

TypeError: cannot do label indexing on <class 'pandas.indexes.base.Index'> with these indexers [0] of <class 'int'>

In [68]:
# use a list to get several specific elements
s.loc[['f', 'b']]

f   -9
b    3
dtype: int64

In [69]:
# can slice with labels
s.loc['c':]

c      3
d     77
e    100
f     -9
dtype: int64

In [70]:
s.loc['c':'f']

c      3
d     77
e    100
f     -9
dtype: int64

In [71]:
s.loc['c':'f':2]

c      3
e    100
dtype: int64

In [72]:
# KeyError and not TypeError since 'h' is a character
s.loc['h']

KeyError: 'the label [h] is not in the [index]'

### Deprecation of `.ix`
In the early days of pandas, **`.ix`** was a popular way to access elements as it could take both positional and label arguments. But, this is confusing as it is ambiguous. Thankfully, **`.ix`** has been deprecated going into pandas version 0.20 so it should become less used.

Unfortunately, many historical (and new) questions on stackoverflow use **`.ix`**. Do not let this fool you. Use **`.loc`** and **`iloc`** instead.

# Why all the fuss over this index
indexes in Series and DataFrames play a huge (and perhaps surprising) roll in pandas. Lets begin with one of these surprising examples

In [73]:
# create two Series. One with the default index starting from 0.
# the other starting at 1
s1 = pd.Series(np.arange(10))
s2 = pd.Series(np.arange(10), index = np.arange(1, 11))

In [74]:
s1

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

In [75]:
s2

1     0
2     1
3     2
4     3
5     4
6     5
7     6
8     7
9     8
10    9
dtype: int64

In [76]:
# These series only differ by 1 in their index and have the same values
# first lets just add the values (which are just the original numpy arrays)
s1.values + s2.values

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

### That made sense. Lets see what happens when we add the series together

In [77]:
# wtf?
s1 + s2

0      NaN
1      1.0
2      3.0
3      5.0
4      7.0
5      9.0
6     11.0
7     13.0
8     15.0
9     17.0
10     NaN
dtype: float64

### What happened?
A couple things look wrong. Not only are the values one off the original numpy array addition, there are two NaNs (not a number, also known as missing values). This example is fundamental to understanding many pandas operations. Series **automatically aligned** on their index (not by integer position).  Index 1 in `s1` got aligned with index 1 in `s2` which produced the operation 1 + 0. Only the indexes 1 - 9 got paired up. Index 0 from s1 and index 10 from s2 didn't align but still returned a NaN result. This is actually similar to a SQL outer join.

Lets look at one more example to drive the point home

In [78]:
# The only index label in common was 'd'. The rest are missing
s1 = pd.Series(np.arange(4), index=list('abcd'))
s2 = pd.Series(np.arange(4), index=list('defg'))
s1 + s2

a    NaN
b    NaN
c    NaN
d    3.0
e    NaN
f    NaN
g    NaN
dtype: float64

### More surprises with indexes
There is no enforcement of uniqueness on the index so all your elements in your series can have the same index value. Lets look at some operations on Series when they have there are duplicated indexes.

In [14]:
s1 = pd.Series(np.arange(8), index=list('aaaabbbb'))
s2 = pd.Series(np.arange(5), index=list('aabbc'))
s2

a    0
a    1
b    2
b    3
c    4
dtype: int32

In [80]:
s1 + s2

a     0.0
a     1.0
a     1.0
a     2.0
a     2.0
a     3.0
a     3.0
a     4.0
b     6.0
b     7.0
b     7.0
b     8.0
b     8.0
b     9.0
b     9.0
b    10.0
c     NaN
dtype: float64

In [81]:
# A bit advancd but gives us a clearer look of how the aligning is happening
df = s1.to_frame(name='s1').join(s2.to_frame('s2'), how='outer')
df['sum'] = df['s1'] + df['s2']
df

Unnamed: 0,s1,s2,sum
a,0.0,0,0.0
a,0.0,1,1.0
a,1.0,0,1.0
a,1.0,1,2.0
a,2.0,0,2.0
a,2.0,1,3.0
a,3.0,0,3.0
a,3.0,1,4.0
b,4.0,2,6.0
b,4.0,3,7.0


###  The indexes align first and then the values are summed
If there were 5 'a' indexes in the first series and 7 'a' indexes in the second series then there would be 35 total 'a' indexes after summing both series (like we did above). This behavior is very different than a numpy array which simply aligns elements by their position first and then adds them. 

Also worth mentioning is that index aligning allows you to add series of different lengths while numpy arrays of different lengths cannot be added.

# End of Section Summary
* Know that pandas is built on top of numpy
* The main data structures of Pandas are the Series and the DataFrame
* The index is the main construct that separates a numpy array from a pandas DataFrame
* Always know the types of your objects so you know what they are capable of doing
* Use %timeit
* Create a Series with a custom index
* Access Series elements with loc and iloc
* Understand automatic alignment of the index
* Know basic Series methods value_counts, sum, max, min, head etc...
* Be familiar with vectorization - no for loops

# Your Turn!

### Problem 1
<span  style="color:green; font-size:16px">What type of object is returned from the values of the index of a Series?</span>

In [31]:
import pandas as pd
import numpy as np
type(pd.Series(index=list('aaabb'), data=np.random.rand(5)).index)

pandas.indexes.base.Index

### Problem 2
<span  style="color:green; font-size:16px">Create a 3 element pandas Series using the Series constructor with characters as the index and numbers as the values. Output the Series.</span>

In [33]:
s1 = pd.Series(index=list('abc'), data=np.random.rand(3))
s1


a    0.011222
b    0.225358
c    0.749099
dtype: float64

### Problem 3
<span  style="color:green; font-size:16px">Another way to create a series is to pass a dictionary to the pandas series constructor. The keys of the dictionary become the Series index and the dictionary values become the Series values. Create a dictionary with at least 3 elements and use it to create a series. Output the Series.</span>

In [37]:
d1 = {'a':5, 'b':7, 'c':-1}
s2 = pd.Series(d1)
type(s2)
s2

a    5
b    7
c   -1
dtype: int64

### Using NumPy to create a Series with random values
A common way to create Series for practice is to fill them with NumPy random values. NumPy has an excellent random module that provides more functionality than the built-in Python random module for creating all sort of random numbers from different distributions.

Below, a Series of length 100 will be created with random numbers between 0 and 1 using the `np.random.rand` function. The index will be started from 10.

In [17]:
idx = range(10, 110) # generate values for the index
values = np.random.rand(100) # generate 100 random numbers between 0 and 1

s = pd.Series(data=values, index=idx)

### Problem 4
<span  style="color:green; font-size:16px">Output to the screen the first 10 numbers in the Series above. Remember to only use **loc** and **iloc** when accessing Series elements.</span>

In [48]:
 s.head(10).equals(s.iloc[:10])

True

In [43]:
s.head()

10    0.911373
11    0.540139
12    0.670827
13    0.415421
14    0.964603
dtype: float64

In [45]:
s[10]

0.91137340340914441

### Problem 5
<span  style="color:green; font-size:16px">Output elements with labels 40, 50 and 99 from the Series above.</span>

In [50]:
s.loc[[40,50,99]]

40    0.565047
50    0.202488
99    0.969443
dtype: float64

### Problem 6
<span  style="color:green; font-size:16px">Output the last ten elements of the Series.</span>

In [55]:
s.iloc[-10:]

100    0.289168
101    0.917693
102    0.943537
103    0.929429
104    0.448697
105    0.401452
106    0.529639
107    0.500159
108    0.840411
109    0.491916
dtype: float64

### Problem 7
<span  style="color:green; font-size:16px">Output every 15th element using slice notation.</span>

In [56]:
s.iloc[::15]

10     0.911373
25     0.878402
40     0.565047
55     0.150235
70     0.194685
85     0.265078
100    0.289168
dtype: float64

### Problem 8
<span  style="color:green; font-size:16px">Write a function that accepts a single argument. The argument will be a Series. Have the function return the difference between the largest and smallest Series value. Run your function with the Series above.</span>

In [58]:
def diff(sx):
    return(sx.max()-sx.min())
diff(s)
    

0.96676107030323688

### Problem 9
<span  style="color:green; font-size:16px">If two Series are added with no indices in common, what will be the outcome? Check your answer by coding this situation.</span>

In [60]:
s_10 = pd.Series(index=list('abc'), data=np.random.rand(3))
s_20 = pd.Series(index=list('def'), data=np.random.rand(3))
s_10.add(s_20)

a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
f   NaN
dtype: float64

### Problem 10
<span  style="color:green; font-size:16px">What if the two series from problem 9 were subtracted, multiplied or divided together?</span>

In [63]:
s_10.sub(s_20), s_10/s_20

(a   NaN
 b   NaN
 c   NaN
 d   NaN
 e   NaN
 f   NaN
 dtype: float64, a   NaN
 b   NaN
 c   NaN
 d   NaN
 e   NaN
 f   NaN
 dtype: float64)

### Problem 11
<span  style="color:green; font-size:16px">Create two Series that have 3 elements each and when added together yield a Series that has four 4 elements that are all not missing.</span>

In [67]:
s_20 = pd.Series(index=list('abc'), data=np.random.rand(3))
s_21 = pd.Series(index=list('abd'), data=np.random.rand(3))
s_20, s_21, s_20.add(s_21,fill_value=1)

(a    0.829741
 b    0.513230
 c    0.165454
 dtype: float64, a    0.576981
 b    0.597853
 d    0.762838
 dtype: float64, a    1.406721
 b    1.111083
 c    1.165454
 d    1.762838
 dtype: float64)