# Series & DataFrame Data Types

> There are two main data types in pandas that we will use in data science. 
> - DataFrame 
> - Series

For the most thorough review of Pandas and Numpy, I suggest you buy and read Wes McKinney's [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do). In due time I will have a separate module covering the most commonly used Python libraries for data science.

## Series

Series is probably the simplest data type in data science. It can be thought of as a single columned data type that has many rows of data. Visually it looks tall and skinny.

In [32]:
# Pay close attention to all of the modules that you need to import to follow everything.
# Note that the aliases I will use for imports are unanimously recognized by the community. 

import pandas as pd 
import numpy as np
from pandas import Series, DataFrame

If you're curious we can also observe what versions of each package we have by calling the __version__. 

In [33]:
pd.__version__ # we are uisng 0.19.2 of Pandas 

u'0.19.2'

In [34]:
ser = Series([1,2,3,4],index=['a','b','c','d'], name='our first Series') 
ser

a    1
b    2
c    3
d    4
Name: our first Series, dtype: int64

In [35]:
ser.index # display the index 

Index([u'a', u'b', u'c', u'd'], dtype='object')

In [36]:
ser.values # display the values 

array([1, 2, 3, 4])

### Select Value By Corresponding Non-Numeric Index or Numeric index

In [37]:
ser['a'] # non-numeric index 

1

In [38]:
ser[0] # numerical index / note that index starts at 0 

1

In [39]:
ser.iloc[0] # iloc accepts a numeric index 

1

In [40]:
ser.loc['a'] # loc is to specify we want value given non-numerical index 

1

In [41]:
ser.ix[0] # ix is general purpose / can take both numeric and non-numeric index

1

In [42]:
ser.ix['a'] # as mentioned ix can also take non-numeric index 

1

Suppose that we also want to view the index along with the value.

Then just place the index inside another list.

In [106]:
ser[[0]]

a    1
Name: our first Series, dtype: int64

In [107]:
ser.iloc[[0]]

a    1
Name: our first Series, dtype: int64

In [108]:
ser.loc[['a']]

a    1
Name: our first Series, dtype: int64

In [109]:
ser.ix[[0]]

a    1
Name: our first Series, dtype: int64

In [110]:
ser.ix[['a']]

a    1
Name: our first Series, dtype: int64

What we've constructed are legitimate Series. They are a partial view of the entire Series, but nevertheless are complete Series which allows us to call methods on them like index and values. 

In [111]:
ser.ix[[0]].index

Index([u'a'], dtype='object')

In [112]:
ser.ix[[0]].values

array([1])

### Boolean Indexing

In [103]:
ser[[True, False, False, False]] # we can also provide a Boolean index (True indicates that we want the value )

a    1
Name: our first Series, dtype: int64

### Select Multiple Values By Corresponding Non-Numeric Index or Numeric index

In [75]:
ser[['a','c']] # just pass a list of the index that we want 

a    1
c    3
Name: our first Series, dtype: int64

In [76]:
ser[[0,1]]

a    1
b    2
Name: our first Series, dtype: int64

In [77]:
ser.iloc[[0,1]]

a    1
b    2
Name: our first Series, dtype: int64

In [78]:
ser.loc[['a','b']]

a    1
b    2
Name: our first Series, dtype: int64

In [79]:
ser.ix[[0,1]]

a    1
b    2
Name: our first Series, dtype: int64

In [80]:
ser.ix[['a','b']]

a    1
b    2
Name: our first Series, dtype: int64

In [114]:
ser[[True, True, False, False]] # boolean indexing 

a    1
b    2
Name: our first Series, dtype: int64

Note that when selecting for multiple values via multiple index, a Series is returned automatically. If we just want the values, just call values. If we just want the index, call index.

In [115]:
ser[[True, True, False, False]].values

array([1, 2])

In [116]:
ser[[True, True, False, False]].index

Index([u'a', u'b'], dtype='object')

### Boolean Logic

Note from above that to select specific value(s) of a Series we just need an index (either numeric or non-numeric).

Suppose that we only want part of the Series that has values greater than 2. Then we want the index corresponding to values greater than 2. 

In [117]:
ser.values 

array([1, 2, 3, 4])

In [118]:
ser.values > 2 # x > 2 is applied to each item x in our list and we get a boolean response 

array([False, False,  True,  True], dtype=bool)

Remember we said that we can use boolean indexing to return certain values of interest to us.

In [119]:
ser[ser.values > 2 ]

c    3
d    4
Name: our first Series, dtype: int64

In [120]:
ser.loc[ser.values > 2 ]

c    3
d    4
Name: our first Series, dtype: int64

In [121]:
ser.ix[ser.values > 2 ]

c    3
d    4
Name: our first Series, dtype: int64

In [122]:
ser.iloc[ser.values > 2 ]

c    3
d    4
Name: our first Series, dtype: int64

It turns out that you can pass entire Series of boolean values and Series will understand what we are trying to do. For example... 

In [137]:
ser > 2

a    False
b    False
c     True
d     True
Name: our first Series, dtype: bool

Contrast the output of "ser > 2" with the output of "ser.values > 2".

In [138]:
ser.values > 2

array([False, False,  True,  True], dtype=bool)

In [136]:
ser[ser > 2]

c    3
d    4
Name: our first Series, dtype: int64

## DataFrame

DataFrames can intuitively be thought of as many Series stacked horizontally. Hence a DataFrame will have multiple columns which warrant column names. Similar to Series, a DataFrame will have an index that can help us identify rows of data that are of interest to us.

In [123]:
arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
rownames = ['row1','row2', 'row3']
colnames = ['col1','col2','col3']
df = DataFrame(arr,index=rownames,columns=colnames)
df

Unnamed: 0,col1,col2,col3
row1,1,2,3
row2,4,5,6
row3,7,8,9


### Select Values by Column 

In [44]:
df['col1'] # specify the column name with brackets

row1    1
row2    4
row3    7
Name: col1, dtype: int64

In [45]:
df.col1 # this is also a working syntax 

row1    1
row2    4
row3    7
Name: col1, dtype: int64

### Selecting Values by Multiple Columns

In [46]:
# simply specify a list of column names 
df[['col1','col3']]

Unnamed: 0,col1,col3
row1,1,3
row2,4,6
row3,7,9


We can also specify a list of column indexes. If we want to get col1 and col3, the corresponding numeric index is 0 and 2  

In [81]:
df[[0,2]] # specify numeric index 

Unnamed: 0,col1,col3
row1,1,3
row2,4,6
row3,7,9


### Select Values by Row

For selecting DataFrame values by row think about how we searched values by index for Series: loc, iloc and ix.

In [50]:
df.loc['row1'] # loc takes in non-numerical index label

col1    1
col2    2
col3    3
Name: row1, dtype: int64

In [51]:
df.iloc[0] # iloc takes in numerical index label 

col1    1
col2    2
col3    3
Name: row1, dtype: int64

In [52]:
df.ix['row1'] # ix can take in either numerical and non-numerical index

col1    1
col2    2
col3    3
Name: row1, dtype: int64

In [53]:
df.ix[0] # ix can take in either numerical and non-numerical index

col1    1
col2    2
col3    3
Name: row1, dtype: int64

### Selecting Values by Multiple Rows

In [54]:
df.ix[['row1','row3']] # simply specify a list of column names 

Unnamed: 0,col1,col2,col3
row1,1,2,3
row3,7,8,9


In [55]:
df.ix[[0,2]] 

Unnamed: 0,col1,col2,col3
row1,1,2,3
row3,7,8,9


### Relationship Between Series and DataFrame

Note that the relationship between a DataFrame and a Series is intimate. The object returned by selecting a column or a row is a Series. In fact a DataFrame is just many Series stacked together either horizonally or vertically.

Let's check using the type method which returns the type of the object.

In [24]:
type(mydataframe)

pandas.core.frame.DataFrame

In [56]:
df['col1'] # select a column of the DataFrame

row1    1
row2    4
row3    7
Name: col1, dtype: int64

In [57]:
type(df['col1'])

pandas.core.series.Series

In [None]:
Another way to think about a Series is a single 

In [59]:
df.ix['row1']

col1    1
col2    2
col3    3
Name: row1, dtype: int64

In [60]:
type(df.ix['row1'])

pandas.core.series.Series

### Selecting Value by Row and Column

We're actually going to be using the same method as selecting rows, that is ix, iloc and loc.

Instead of passing just a row index, we're also going to specify columns that we want. 

In [62]:
df

Unnamed: 0,col1,col2,col3
row1,1,2,3
row2,4,5,6
row3,7,8,9


We have to be mindful that "loc" takes in index labels, not numeric index labels. "ix" can take both. "iloc" can only take numeric index labels.

In [63]:
df.loc['row1', 'col1'] # find me the entry that is located at row1 and col1

1

In [64]:
df.ix['row1', 'col1'] 

1

In [68]:
df.iloc['row1','col1'] # this will return an error because REMEMBER iloc only accepts numeric index

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

In [69]:
df.iloc[0,0] # find me the entry that is located at at 0th row and 0th column, that is row1 x col1

1

And of course we can generalize to specifying multiple rows and columns. Just pass in a list of rows and a list of columns in that order.

In [82]:
df.loc[['row1','row3'],['col1','col3']]

Unnamed: 0,col1,col3
row1,1,3
row3,7,9


### Using Boolean Logic on DataFrames

In [130]:
df['col1'] > 2

row1    False
row2     True
row3     True
Name: col1, dtype: bool

In [131]:
df['col3'] > 2

row1    True
row2    True
row3    True
Name: col3, dtype: bool

In [133]:
(df['col1'] > 2) & (df['col3'] > 2) # & denotes AND 

row1    False
row2     True
row3     True
dtype: bool

### There's so much more.

There's so many cool things we can do with Series and DataFrames. I'm saddened I can't cover them all here. The specific "cool" things we can do with Pandas and Numpy will be covered in a separate module.

In [None]:
Luckily Pandas is intuitive to use.

We're going to be seeing Series and DataFrames all the time in data science. Familiarity with different data types and their methods are going to be important moving forward. 

# We're Done!

This tutorial closely follows my Medium blog [@dhexonian](http://medium.com/@dhexonian).

If you have any questions or requests please Tweet those to me, also [@dhexonian](https://twitter.com/dhexonian) 