Pandas Series Model
--------------------------
Example of series Data
Prepared by abhijeet mote
abhijeetmote@gmail.com

In [None]:
import pandas as pd
import numpy as np

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call

If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1].

In [None]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [None]:
s

In [None]:
 s.index

In [None]:
pd.Series(np.random.randn(5))

From dict
------------
If data is a dict, if index is passed the values in data corresponding to the labels in the index will be pulled out. Otherwise, an index will be constructed from the sorted keys of the dict, if possible.


In [None]:
d = {'a' : 0., 'b' : 1., 'c' : 2.}

In [None]:
pd.Series(d)

In [None]:
pd.Series(d, index=['b', 'c', 'd', 'a'])

### From scalar value If data is a scalar value, an index must be provided. The value will be repeated to match the length of index

In [None]:
pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])

Series is ndarray-like
-------------------------
Series acts very similarly to a ndarray, and is a valid argument to most NumPy functions. However, things like slicing also slice the index.

In [None]:
s[0]

In [None]:
s[:3]

In [None]:
s[s > s.median()]

In [None]:
s.median()

In [None]:
s[[4, 3, 1]]

In [None]:
np.exp(s)

Series is dict-like
----------------------
A Series is like a fixed-size dict in that you can get and set values by index label:


In [None]:
s['a']

In [None]:
 s['e'] = 12.

In [None]:
s

In [None]:
'e' in s

In [None]:
'f' in s

### If a label is not contained, an exception is raised:

In [None]:
s['f']

Using the get method, a missing label will return None or specified default:


In [None]:
s.get('f')

Vectorized operations and label alignment with Series
-------------------------
When doing data analysis, as with raw NumPy arrays looping through Series value-by-value is usually not necessary. Series can be also be passed into most NumPy methods expecting an ndarray.


In [None]:
s+s

In [None]:
s * 2

In [None]:
np.exp(s)

In [None]:
s = pd.Series(np.random.randn(5), name='something')

In [None]:
s.name

# Dataframe

From dict of Series or dicts
--------------------------
The result index will be the union of the indexes of the various Series. If there are any nested dicts, these will be first converted to Series. If no columns are passed, the columns will be the sorted list of dict keys.



In [None]:
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}


In [None]:
d

In [None]:
df = pd.DataFrame(d)
df

In [None]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

The row and column labels can be accessed respectively by accessing the index and columns attributes:


In [None]:
df.columns


In [None]:
df['A']

In [None]:
df['one'][2]

From dict of ndarrays / lists
--------------------------------
The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as the arrays. If no index is passed, the result will be range(n), where n is the array length.


In [None]:
d = {'one' : [1., 2., 3., 4.],
     'two' : [4., 3., 2., 1.]}

In [None]:
pd.DataFrame(d)

In [None]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

## From a list of dicts

In [None]:
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [None]:
pd.DataFrame(data2)

In [None]:
pd.DataFrame(data2, index=['first', 'second'])

In [None]:
pd.DataFrame(data2, columns=['a', 'b'])

From a dict of tuples
-------------------------
You can automatically create a multi-indexed frame by passing a tuples dictionary


In [None]:
pd.DataFrame(
             {('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
              ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
              ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
              ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
              ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}
              }
             )

## Column selection, addition, deletion

In [None]:
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

In [None]:
df = pd.DataFrame(d)

In [None]:
df

In [None]:
df['one']

In [None]:
df['three'] = df['one'] * df['two']

In [None]:
df['flag'] = df['one'] > 2

## Columns can be deleted or popped like with a dict:

In [None]:
del df['two']

In [None]:
three = df.pop('three')

In [None]:
df

Inserting scalar value
-----------------------
When inserting a scalar value, it will naturally be propagated to fill the column:

In [None]:
df['foo'] = 'bar'

Inserting values in Dataframe
--------------
df.insert(index, value)

In [None]:
df.insert(1, 'bar', df['one'])

In [None]:
df

Viewing Data
--------------

In [None]:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

In [None]:
df.head(10)

In [None]:
df.tail(10)

Display the index, columns
------------------------


In [None]:
df.index

In [None]:
df.columns

In [None]:
df.values

Describe shows a quick statistic summary of your data
---------------


In [None]:
df.describe()

Transposing your data
-------------------

In [None]:
df.T

Sorting by an axis
------



In [None]:
df

In [None]:
df.sort_index(axis=1, ascending=False)

Sorting by values
-------


In [None]:
df.sort_values(by='B')

# Selection 

Selecting a single column, which yields a Series, equivalent to df.A
--

In [None]:
df['A']

In [None]:
df[0:3]

In [None]:
df['20130102':'20130104']

Selection by Label
-----


In [None]:
df.loc[dates[0]]

In [None]:
df.loc[:,['A','B']]

Showing label slicing, both endpoints are included
-------

In [None]:
df.loc['20130102':'20130104',['A','B']]

Reduction in the dimensions of the returned object

In [None]:
df.loc['20130102',['A','B']]

In [None]:
df.loc[dates[0],'A']

For getting a scalar value
---


In [None]:
df.loc[dates[0],'A']

For getting fast access to a scalar (equiv to the prior method)
------

In [None]:
df.at[dates[0],'A']

Selection by Position
----- 

In [None]:
df.iloc[3]

In [None]:
df.iloc[3:5,0:2]

Boolean Indexing
-----------

Using a single column’s values to select data.


In [None]:
df[df.A > 0]

In [None]:
df[df > 0]

Using isin 

In [None]:
df2 = df.copy()

In [None]:
df2['E'] = ['one', 'one','two','three','four','three']

In [None]:
df2

In [None]:
df2[df2['E'].isin(['two','four'])]

Missing Data
------

In [None]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])

In [None]:
df1

In [None]:
df1.loc[dates[0]:dates[1],'E'] = 1


In [None]:
df1

To drop any rows that have missing data.
------

To drop any rows that have missing data.

In [None]:
df1.dropna(how='any')

Filling missing data


In [None]:
df1.fillna(value=5)

To get the boolean mask where values are nan


In [None]:
pd.isnull(df1)

# Stats

Performing a descriptive statistic

In [None]:
df.mean()

Same operation on the other axis

In [None]:
df.mean(1)

In [None]:
df

Applying functions to the data
-----------

In [None]:
df.apply(lambda x: x + 1 )

Merge
-----

Concat


In [None]:
df = pd.DataFrame(np.random.randn(10, 4))


In [None]:
pieces = [df[:3], df[3:7], df[7:]]

In [None]:
pd.concat(pieces)

Join
------

In [None]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})

In [None]:
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

In [None]:
pd.merge(left, right, on='key')

In [None]:
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})

In [None]:
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

Append
--------

In [None]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df.iloc[3]

In [None]:
df.append(s, ignore_index=True)

Grouping
------------

In [None]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
   

Grouping and then applying a function sum to the resulting groups.


In [None]:
df.groupby('A').sum()

Grouping by multiple columns forms a hierarchical index, which we then apply the function.

In [None]:
df.groupby(['A','B']).sum()

stack
------
The stack() method “compresses” a level in the DataFrame’s columns.


In [None]:
stacked = df.stack()

With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack() is unstack(), which by default unstacks the last level:

In [None]:
stacked.unstack()

In [None]:
 pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])