## Pandas Series and Dataframes

**Pandas** is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.
Pandas provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. 

## Series
Pandas Series is a one-dimensional array-like object that has index and value just like NumPy. In fact if you view the type of the values of series object, you will see that it indeed is `numpy.ndarray`.

You can assign name to pandas Series.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-darkgrid')

In [3]:
ob = pd.Series([8,7,6,5], name='test_data')
print('Name: ',ob.name)
print('Data:\n',ob)
print('Type of Object: ',type(ob))
print('Type of elements:',type(ob.values))

Name:  test_data
Data:
 0    8
1    7
2    6
3    5
Name: test_data, dtype: int64
Type of Object:  <class 'pandas.core.series.Series'>
Type of elements: <class 'numpy.ndarray'>


You can also use your numpy array and convert them to Series.

In [4]:
# integers between 5 to 8 (reversed)
ob = pd.Series(np.linspace(5, 8, num=4, dtype=int)[::-1])
print(ob)
print(type(ob))

0    8
1    7
2    6
3    5
dtype: int64
<class 'pandas.core.series.Series'>


You can also provide custom index to the values and just like in Numpy, access them with the index.

In [5]:
ob = pd.Series([8,7,6,5], index=['a','b','c','d'])
print(ob['b'])

7


Pandas Series is more like a fixed size dictionary whose mapping of index-value is preserved when array operations are applied to them. For example,

In [6]:
# select all the values greater than 4 and less than 8
print(ob[(ob>4) & (ob<8)])

b    7
c    6
d    5
dtype: int64


This also means that if you have a dictionary, you can easily convert that into pandas series.

In [7]:
states_dict = {'State1': 'Alabama', 
               'State2': 'California', 
               'State3': 'New Jersey', 
               'State4': 'New York'}
ob = pd.Series(states_dict)
print(ob)
print(type(ob))

State1       Alabama
State2    California
State3    New Jersey
State4      New York
dtype: object
<class 'pandas.core.series.Series'>


Just like dictionaries, you can also change the index using the following method. 

In [8]:
ob.index = ['AL','CA','NJ','NY']
print(ob)

AL       Alabama
CA    California
NJ    New Jersey
NY      New York
dtype: object


or use dictionary's method to get the label..

In [9]:
ob.get('CA', np.nan)

'California'

## Dataframe
Dataframe is something like spreadsheet or a sql table. It is basically a 2 dimensional labelled data structure with columns of potentially different datatype. Like Series, DataFrame accepts many different kinds of input:
* Dict of 1D ndarrays, lists, dicts, or Series
* 2-D numpy.ndarray
* [`Structured or record ndarray`](http://docs.scipy.org/doc/numpy/user/basics.rec.html 'Structured or record ndarray')
* A Series
* Another DataFrame

Compared with other such DataFrame-like structures you may have used before (like `R’s` `data.frame`), row- oriented and column-oriented operations in DataFrame are treated roughly symmetrically. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays.


### Creating Dataframes from dictionaries

In [10]:
data = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
    'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

In [11]:
df = pd.DataFrame(data)
print('Dataframe:\n',df)
print('Type of Object:',type(df))
print('Type of elements:',type(df.values))

Dataframe:
    one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
Type of Object: <class 'pandas.core.frame.DataFrame'>
Type of elements: <class 'numpy.ndarray'>


Another way to construct dataframe from dictionaries is by using DataFrame.from_dict function. DataFrame.from_dict takes a dict of dicts or a dict of array-like sequences and returns a DataFrame. It operates like the DataFrame constructor except for the orient parameter which is 'columns' by default, but which can be set to 'index' in order to use the dict keys as row labels.

Just like Series, you can access index, values and also columns.

In [12]:
print('Index: ',df.index)
print('Columns: ',df.columns)
print('Values of Column one: ',df['one'].values)
print('Values of Column two: ',df['two'].values)

Index:  Index(['a', 'b', 'c', 'd'], dtype='object')
Columns:  Index(['one', 'two'], dtype='object')
Values of Column one:  [ 1.  2.  3. nan]
Values of Column two:  [1. 2. 3. 4.]


### Creating dataframe from list of dictionaries

As with Series, if you pass a column that isn’t contained in data, it will appear with NaN values in the result

In [13]:
df2 = pd.DataFrame([{'a': 1, 'b': 2, 'c':3, 'd':None}, 
                    {'a': 2, 'b': 2, 'c': 3, 'd': 4}],
                   index=['one', 'two'])
print('Dataframe: \n',df2)

# Ofcourse you can also transpose the result:
print('Transposed Dataframe: \n',df2.T)

Dataframe: 
      a  b  c    d
one  1  2  3  NaN
two  2  2  3  4.0
Transposed Dataframe: 
    one  two
a  1.0  2.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0


Assigning a column that doesn’t exist will create a new column. 

In [14]:
df['three'] = None
print('Added third column: \n',df)

# The del keyword can be used delete columns:
del df['three']
print('\nDeleted third column: \n',df)
# You can also use df.drop(). We shall see that later

Added third column: 
    one  two three
a  1.0  1.0  None
b  2.0  2.0  None
c  3.0  3.0  None
d  NaN  4.0  None

Deleted third column: 
    one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0


Each Index has a number of methods and properties for set logic and answering other common questions about the data it contains.


|Method | Description|
|:---|:---|
|`append` | Concatenate with additional Index objects, producing a new Index|
|`diff` | Compute set difference as an Index|
|`intersection` | Compute set intersection|
|`union` | Compute set union|
|`isin` | Compute boolean array indicating whether each value is contained in the passed collection|
|`delete` | Compute new Index with element at index i deleted|
|`drop` | Compute new index by deleting passed values|
|`insert` | Compute new Index by inserting element at index i|
|`is_monotonic` | Returns True if each element is greater than or equal to the previous element| 
|`is_unique` | Returns True if the Index has no duplicate values|
|`unique` | Compute the array of unique values in the Index|

for example:

In [15]:
print(1 in df.one.values)
print('one' in df.columns)

True
True


## Reindexing
A critical method on Pandas objects is reindex, which means to create a new object with the data conformed to a new index.

The following is how you might reindex.

In [16]:
data = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
    'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
print(df)

   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0


In [17]:
# Reindex in descending order.
print(df.reindex(['d','c','b','a']))

   one  two
d  NaN  4.0
c  3.0  3.0
b  2.0  2.0
a  1.0  1.0


If you `reindex` with more number of rows than in the dataframe, it will return the dataframe with new row whose values are `NaN`.

In [18]:
print(df.reindex(['a','b','c','d','e']))

   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
e  NaN  NaN


Reindexing is also useful when you want to introduce any missing values. For example in our case, look at column `one` and row `d`

In [19]:
df.reindex(['a','b','c','d','e'], fill_value=0)
# Guess why the df['one']['d'] was not filled with 0 ?

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0
e,0.0,0.0


For ordered data like time series, it may be desirable to do some interpolation or filling of values when `reindex`ing. The method option allows us to do this, using a `method` such as `ffill` which forward fills the values:

In [20]:
df.reindex(['a','b','c','d','e'], method='ffill')

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0
e,,4.0


There are basically two different types of method (interpolation) options:

|Method | Description|
|:---|:---|
|`ffill` or `pad` | Fill (or carry) values forward |
|`bfill` or `backfill` | Fill (or carry) values backward|

Reindexing has following arguments:

|Argument | Description|
|:---|:---|
|`index` | New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An Index will be used exactly as is without any copying|
|`method` | Interpolation (fill) method, see above table for options.|
|`fill_value` | Substitute value to use when introducing missing data by reindexing.|
|`limit` | When forward- or backfilling, maximum size gap to fill|
|`level` | Match simple Index on level of MultiIndex, otherwise select subset of|
|`copy` | Do not copy underlying data if new index is equivalent to old index. True by default (i.e. always copy data)|

## Dropping Entries
Dropping one or more entries from an axis is easy if you have an index array or list without those entries.

In [21]:
# Drop row c and row a
df.drop(['c', 'a'])

Unnamed: 0,one,two
b,2.0,2.0
d,,4.0


In [22]:
# And to drop column two try this
df.drop(['two'], axis=1)

Unnamed: 0,one
a,1.0
b,2.0
c,3.0
d,


## Indexing, selection, Sorting and filtering
Series indexing works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers.

To slice and select only column one for rows 0 and 4 use the following.

In [23]:
print("Dataframe: \n",df)
# Slicing and selecting only column `one` for row 0 and row 4
df['one'][['a', 'd']]

Dataframe: 
    one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0


a    1.0
d    NaN
Name: one, dtype: float64

In [24]:
# Slicing df from row b to row 4 for column `one`
df['one']['b':'d']

b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

If you observe the above command (and the one above it), you will see that slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive.

For DataFrame label-indexing on the rows, there is a special indexing field `ix` (or `loc`). It enables you to select a subset of the rows and columns from a DataFrame with NumPy- like notation plus axis labels. It is a less verbose way to do the reindexing.

In [37]:
df.loc[['a','c'],['one']]

Unnamed: 0,one
a,1.0
c,3.0


In [26]:
df.loc[['a', 'c'], ['one']]

Unnamed: 0,one
a,1.0
c,3.0


In [27]:
df.ix[df.one > 1]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Unnamed: 0,one,two
b,2.0,2.0
c,3.0,3.0


There are many ways to select and rearrange the data contained in a pandas object. Some indexing options can be seen in below table:

|Indexing Type| Description|
|:---|:---|
|df[val] | Select single column or sequence of columns from the DataFrame. Special case con- veniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion).|
|df.ix[val] | Selects single row of subset of rows from the DataFrame.|
|df.ix[:, val] | Selects single column of subset of columns.|
|df.ix[val1, val2] | Select both rows and columns.|
|reindex method | Conform one or more axes to new indexes.|
|xs method | Select single row or column as a Series by label.|
|icol, irowmethods | Select single column or row, respectively, as a Series by integer location.|
|get_value, set_value methods | Select single value by row and column label.|

You can sort a data frame or series (by some criteria) using the built-in functions. To sort lexicographically by row or column index, use the sort_index method, which returns a new, sorted object:

In [28]:
dt = pd.Series(np.random.randint(3, 10, size=7), 
               index=['g','c','a','b','e','d','f'])
print('Original Data: \n', dt, end="\n\n")
print('Sorted by Index: \n',dt.sort_index())

Original Data: 
 g    6
c    7
a    7
b    4
e    9
d    3
f    8
dtype: int64

Sorted by Index: 
 a    7
b    4
c    7
d    3
e    9
f    8
g    6
dtype: int64


## Data alignment and arithmetic
Data alignment between DataFrame objects automatically align on both the columns and the index (row labels). The resulting object will have the union of the column and row labels.

In [29]:
df1 = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
print('df1:\n',df1, end="\n\n")
print('df2:\n',df2, end="\n\n")
print('Sum:\n',df1.add(df2))

df1:
           A         B         C         D
0 -1.027802  0.071980  1.083831  0.117767
1  0.928299 -0.975015 -0.256632 -1.022280
2  1.143273 -0.233660 -0.022156  2.083412
3 -0.359587 -0.823477  0.363769  0.449445
4 -0.862179 -0.110148 -1.419463  1.470670
5  0.993296  0.390077  0.597697  0.669840
6 -0.388123  0.274498  0.081150 -0.586702
7  0.193284 -0.385378 -1.711573  0.593487
8 -0.013713  0.201136 -0.813772  0.898498
9  1.709947  0.272814 -1.435655 -0.714309

df2:
           A         B         C
0 -1.382737 -0.206609  1.260968
1  0.023300  0.079133  1.065827
2 -1.708856  1.166897 -0.225432
3  1.585032 -0.499857  1.794270
4 -0.572394 -0.086707 -0.060390
5  1.192763  0.101969  4.061451
6 -0.315777 -0.131411 -1.806998

Sum:
           A         B         C   D
0 -2.410540 -0.134629  2.344798 NaN
1  0.951599 -0.895882  0.809196 NaN
2 -0.565583  0.933236 -0.247588 NaN
3  1.225444 -1.323334  2.158039 NaN
4 -1.434573 -0.196855 -1.479853 NaN
5  2.186058  0.492046  4.659148 NaN
6 -0.70389

Note that in arithmetic operations between differently-indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other:

In [30]:
print('Sum:\n',df1.add(df2, fill_value=0))

Sum:
           A         B         C         D
0 -2.410540 -0.134629  2.344798  0.117767
1  0.951599 -0.895882  0.809196 -1.022280
2 -0.565583  0.933236 -0.247588  2.083412
3  1.225444 -1.323334  2.158039  0.449445
4 -1.434573 -0.196855 -1.479853  1.470670
5  2.186058  0.492046  4.659148  0.669840
6 -0.703899  0.143088 -1.725848 -0.586702
7  0.193284 -0.385378 -1.711573  0.593487
8 -0.013713  0.201136 -0.813772  0.898498
9  1.709947  0.272814 -1.435655 -0.714309


Similarly you can perform subtracion, multiplication and division. 

When doing an operation between DataFrame and Series, the default behavior is to align the Series index on the DataFrame columns, thus broadcasting (just like in numpy) row-wise.

In [31]:
print("Dataframe: \n", df1, end="\n\n")
print("Operand (0th row): \n", df1.loc[0], end="\n\n")
print('Subtraction: \n',df1.sub(df1.loc[0]))

Dataframe: 
           A         B         C         D
0 -1.027802  0.071980  1.083831  0.117767
1  0.928299 -0.975015 -0.256632 -1.022280
2  1.143273 -0.233660 -0.022156  2.083412
3 -0.359587 -0.823477  0.363769  0.449445
4 -0.862179 -0.110148 -1.419463  1.470670
5  0.993296  0.390077  0.597697  0.669840
6 -0.388123  0.274498  0.081150 -0.586702
7  0.193284 -0.385378 -1.711573  0.593487
8 -0.013713  0.201136 -0.813772  0.898498
9  1.709947  0.272814 -1.435655 -0.714309

Operand (0th row): 
 A   -1.027802
B    0.071980
C    1.083831
D    0.117767
Name: 0, dtype: float64

Subtraction: 
           A         B         C         D
0  0.000000  0.000000  0.000000  0.000000
1  1.956102 -1.046995 -1.340462 -1.140046
2  2.171075 -0.305640 -1.105987  1.965645
3  0.668215 -0.895457 -0.720062  0.331678
4  0.165623 -0.182128 -2.503294  1.352903
5  2.021098  0.318097 -0.486134  0.552073
6  0.639680  0.202518 -1.002681 -0.704469
7  1.221086 -0.457358 -2.795403  0.475720
8  1.014089  0.129156 -1.8976

In the special case of working with time series data, and the DataFrame index also contains dates, the broadcasting will be column-wise:

In [32]:
ind1 = pd.date_range('06/1/2017', periods=10)
df1.set_index(ind1)

Unnamed: 0,A,B,C,D
2017-06-01,-1.027802,0.07198,1.083831,0.117767
2017-06-02,0.928299,-0.975015,-0.256632,-1.02228
2017-06-03,1.143273,-0.23366,-0.022156,2.083412
2017-06-04,-0.359587,-0.823477,0.363769,0.449445
2017-06-05,-0.862179,-0.110148,-1.419463,1.47067
2017-06-06,0.993296,0.390077,0.597697,0.66984
2017-06-07,-0.388123,0.274498,0.08115,-0.586702
2017-06-08,0.193284,-0.385378,-1.711573,0.593487
2017-06-09,-0.013713,0.201136,-0.813772,0.898498
2017-06-10,1.709947,0.272814,-1.435655,-0.714309


## Using Numpy functions on DataFrame
Elementwise NumPy `ufuncs` like `log`, `exp`, `sqrt`, ... and various other NumPy functions can be used on DataFrame

In [33]:
np.abs(df1)

Unnamed: 0,A,B,C,D
0,1.027802,0.07198,1.083831,0.117767
1,0.928299,0.975015,0.256632,1.02228
2,1.143273,0.23366,0.022156,2.083412
3,0.359587,0.823477,0.363769,0.449445
4,0.862179,0.110148,1.419463,1.47067
5,0.993296,0.390077,0.597697,0.66984
6,0.388123,0.274498,0.08115,0.586702
7,0.193284,0.385378,1.711573,0.593487
8,0.013713,0.201136,0.813772,0.898498
9,1.709947,0.272814,1.435655,0.714309


In [34]:
# Convert to numpy array
np.asarray(df1)

array([[-1.02780241,  0.07197991,  1.08383075,  0.11776666],
       [ 0.92829924, -0.97501521, -0.2566316 , -1.02227976],
       [ 1.14327256, -0.23366037, -0.02215588,  2.08341213],
       [-0.35958715, -0.82347742,  0.36376864,  0.44944499],
       [-0.86217913, -0.11014768, -1.41946308,  1.47066962],
       [ 0.99329586,  0.39007695,  0.59769692,  0.66983982],
       [-0.38812253,  0.27449811,  0.08114999, -0.58670202],
       [ 0.19328368, -0.38537794, -1.71157274,  0.59348704],
       [-0.01371317,  0.20113613, -0.813772  ,  0.89849793],
       [ 1.70994656,  0.2728144 , -1.43565488, -0.71430899]])

Below you will see another frequent operation is applying a function on 1D arrays to each column or row. DataFrame’s apply method does exactly this:

In [35]:
def fn(x):
    """
    Get max and min of the columns
    """
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

df1.apply(fn)

Unnamed: 0,A,B,C,D
min,-1.027802,-0.975015,-1.711573,-1.02228
max,1.709947,0.390077,1.083831,2.083412


Element-wise Python functions can be used, too. Suppose you wanted to format the dataframe elements in floating point format with accuracy of only 3 decimal places. You can do this with applymap:

In [36]:
fmt = lambda x: "{:.3f}".format(x)
df1.applymap(fmt)

Unnamed: 0,A,B,C,D
0,-1.028,0.072,1.084,0.118
1,0.928,-0.975,-0.257,-1.022
2,1.143,-0.234,-0.022,2.083
3,-0.36,-0.823,0.364,0.449
4,-0.862,-0.11,-1.419,1.471
5,0.993,0.39,0.598,0.67
6,-0.388,0.274,0.081,-0.587
7,0.193,-0.385,-1.712,0.593
8,-0.014,0.201,-0.814,0.898
9,1.71,0.273,-1.436,-0.714


The reason for the name `applymap` for dataframe (instead of simply using `map`)is that pandas Series already has a `map` method for applying an element-wise operation