# Agenda : Pandas

***

## Contents:
- Basics of Pandas
- Advance stuffs in Pandas
- Exerciase in Pandas

In [8]:
import numpy as np
import pandas as pd

Main Features
-------------
Here are just a few of the things that pandas does well:

  - Easy handling of missing data in floating point as well as non-floating
    point data.
  - Size mutability: columns can be inserted and deleted from DataFrame and
    higher dimensional objects
  - Automatic and explicit data alignment: objects can be explicitly aligned
    to a set of labels, or the user can simply ignore the labels and let
    `Series`, `DataFrame`, etc. automatically align the data for you in
    computations.
  - Powerful, flexible group by functionality to perform split-apply-combine
    operations on data sets, for both aggregating and transforming data.
  - Make it easy to convert ragged, differently-indexed data in other Python
    and NumPy data structures into DataFrame objects.
  - Intelligent label-based slicing, fancy indexing, and subsetting of large
    data sets.
  - Intuitive merging and joining data sets.
  - Flexible reshaping and pivoting of data sets.
  - Hierarchical labeling of axes (possible to have multiple labels per tick).
  - Robust IO tools for loading data from flat files (CSV and delimited),
    Excel files, databases, and saving/loading data from the ultrafast HDF5
    format.
  - Time series-specific functionality: date range generation and frequency
    conversion, moving window statistics, moving window linear regressions,
    date shifting and lagging, etc.

let’s introduce these three fundamental Pandas data structures: the Series, DataFrame, and Index.


## The Pandas Series Object
   Pd.tab to know more
    Pd.? will be providing the documentation
    pd.Series?

One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).



In [15]:
# Lets create a panda series

data=pd.Series([1,4,5,6,6,7])
data

0    1
1    4
2    5
3    6
4    6
5    7
dtype: int64

In [16]:
data.values

array([1, 4, 5, 6, 6, 7])

In [17]:
data.index

RangeIndex(start=0, stop=6, step=1)

In [19]:
#Extract the values of a series using index
data[1]

4

In [28]:
data=pd.Series([1,4,5,68.3,4.0,None],index=['a','b','c','d','r','y'])
data

a     1.0
b     4.0
c     5.0
d    68.3
r     4.0
y     NaN
dtype: float64

In [27]:
data['a']

1.0

In [35]:

data=pd.Series([1,4,5,68.3,4.0,None],index=[1,'b','c','d',1,'y'])
data

1     1.0
b     4.0
c     5.0
d    68.3
1     4.0
y     NaN
dtype: float64

In [30]:
data['y']*10

nan

In [32]:
data['b']*10

40.0

In [36]:
pop={'Mumbai':200,'Bangalore':100,'Delhi':200,'Bhubaneswar':50}

In [37]:
pop

{'Mumbai': 200, 'Bangalore': 100, 'Delhi': 200, 'Bhubaneswar': 50}

In [38]:
population=pd.Series(pop)

In [39]:
population

Mumbai         200
Bangalore      100
Delhi          200
Bhubaneswar     50
dtype: int64

In [40]:
population['Mumbai']

200

In [41]:
#Slice
population['Mumbai':'Delhi']

Mumbai       200
Bangalore    100
Delhi        200
dtype: int64

In [44]:
pd.Series({2:'a',3:'b',4:'c'},index= [4,3])

4    c
3    b
dtype: object

# Pandas object : Dataframe

In [45]:
pop={'Mumbai':200,'Bangalore':100,'Delhi':200,'Bhubaneswar':50}
population=pd.Series(pop)

In [50]:
ar={'Mumbai':20,'Bangalore':12,'Delhi':2,'Bhubaneswar':20}
area=pd.Series(ar)

In [47]:
population

Mumbai         200
Bangalore      100
Delhi          200
Bhubaneswar     50
dtype: int64

In [48]:
area

Mumbai         20
Bangalore      12
Delhi           2
Bhubaneswar    20
dtype: int64

In [51]:
census=pd.DataFrame({'population':population,'area':area})

In [52]:
census

Unnamed: 0,population,area
Mumbai,200,20
Bangalore,100,12
Delhi,200,2
Bhubaneswar,50,20


In [53]:
type(census)

pandas.core.frame.DataFrame

In [54]:
census.index

Index(['Mumbai', 'Bangalore', 'Delhi', 'Bhubaneswar'], dtype='object')

In [55]:
census.columns

Index(['population', 'area'], dtype='object')

In [60]:
#Find the maximum area of all city
census['area'].max()


20

In [61]:
type(a)

pandas.core.series.Series

In [63]:
pd.DataFrame([{'a':1,'b':2},{'b':3,'c':6}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,6.0


In [64]:
pd.DataFrame(np.random.rand(3,2),columns=['length','bredth'],index=[1,2,3])

Unnamed: 0,length,bredth
1,0.347757,0.261223
2,0.915517,0.175943
3,0.700797,0.908246


In [65]:
# Features of a index
inda=pd.Index([1,2,3,5,8,10])
indb=pd.Index([3,5,6,2,8])

In [66]:
type(inda)

pandas.core.indexes.numeric.Int64Index

In [70]:
#intersection
inda & indb

Int64Index([2, 3, 5, 8], dtype='int64')

In [71]:
# index union
inda | indb

Int64Index([1, 2, 3, 5, 6, 8, 10], dtype='int64')

In [69]:
# symetric difference
inda ^ indb

Int64Index([1, 6, 10], dtype='int64')

## Implicit & Explicit slicing

In [72]:
data=pd.Series([0.25,0.5,0.75,3.45],index=['a','b','c','d'])

In [73]:
data

a    0.25
b    0.50
c    0.75
d    3.45
dtype: float64

In [76]:
# Explicit slicing
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [77]:
#Implicit slicing
data[0:2]

a    0.25
b    0.50
dtype: float64

In [79]:
#masking
data[(data>0.4) & (data<1)] 

b    0.50
c    0.75
dtype: float64

In [81]:
#fancy indexing
data[['a','d']]

a    0.25
d    3.45
dtype: float64

Next class:
-Loc iloc
 
 Operation using data frames
    Merging different data frames
    Handling missing values
    Replacing missing values
    Importing external data such csv as dataframe
    Aggeregation & pivot tables
    Vectorizing
    Time series
    Visualization