# Agenda : Pandas

***

## Contents:
- Basics of Pandas
- Advance stuffs in Pandas
- Exerciase in Pandas

In [1]:
import numpy as np
import pandas as pd

Main Features
-------------
Here are just a few of the things that pandas does well:

  - Easy handling of missing data in floating point as well as non-floating
    point data.
  - Size mutability: columns can be inserted and deleted from DataFrame and
    higher dimensional objects
  - Automatic and explicit data alignment: objects can be explicitly aligned
    to a set of labels, or the user can simply ignore the labels and let
    `Series`, `DataFrame`, etc. automatically align the data for you in
    computations.
  - Powerful, flexible group by functionality to perform split-apply-combine
    operations on data sets, for both aggregating and transforming data.
  - Make it easy to convert ragged, differently-indexed data in other Python
    and NumPy data structures into DataFrame objects.
  - Intelligent label-based slicing, fancy indexing, and subsetting of large
    data sets.
  - Intuitive merging and joining data sets.
  - Flexible reshaping and pivoting of data sets.
  - Hierarchical labeling of axes (possible to have multiple labels per tick).
  - Robust IO tools for loading data from flat files (CSV and delimited),
    Excel files, databases, and saving/loading data from the ultrafast HDF5
    format.
  - Time series-specific functionality: date range generation and frequency
    conversion, moving window statistics, moving window linear regressions,
    date shifting and lagging, etc.

let’s introduce these three fundamental Pandas data structures: the Series, DataFrame, and Index.


## The Pandas Series Object
   Pd.tab to know more
    Pd.? will be providing the documentation
    pd.Series?

One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).



In [15]:
# Lets create a panda series

data=pd.Series([1,4,5,6,6,7])
data

0    1
1    4
2    5
3    6
4    6
5    7
dtype: int64

In [16]:
data.values

array([1, 4, 5, 6, 6, 7])

In [17]:
data.index

RangeIndex(start=0, stop=6, step=1)

In [19]:
#Extract the values of a series using index
data[1]

4

In [28]:
data=pd.Series([1,4,5,68.3,4.0,None],index=['a','b','c','d','r','y'])
data

a     1.0
b     4.0
c     5.0
d    68.3
r     4.0
y     NaN
dtype: float64

In [27]:
data['a']

1.0

In [35]:

data=pd.Series([1,4,5,68.3,4.0,None],index=[1,'b','c','d',1,'y'])
data

1     1.0
b     4.0
c     5.0
d    68.3
1     4.0
y     NaN
dtype: float64

In [30]:
data['y']*10

nan

In [32]:
data['b']*10

40.0

In [36]:
pop={'Mumbai':200,'Bangalore':100,'Delhi':200,'Bhubaneswar':50}

In [37]:
pop

{'Mumbai': 200, 'Bangalore': 100, 'Delhi': 200, 'Bhubaneswar': 50}

In [38]:
population=pd.Series(pop)

In [39]:
population

Mumbai         200
Bangalore      100
Delhi          200
Bhubaneswar     50
dtype: int64

In [40]:
population['Mumbai']

200

In [41]:
#Slice
population['Mumbai':'Delhi']

Mumbai       200
Bangalore    100
Delhi        200
dtype: int64

In [44]:
pd.Series({2:'a',3:'b',4:'c'},index= [4,3])

4    c
3    b
dtype: object

# Pandas object : Dataframe

In [15]:
pop={'Mumbai':200,'Bangalore':100,'Delhi':200,'Bhubaneswar':50}
population=pd.Series(pop)

In [16]:
ar={'Mumbai':20,'Bangalore':12,'Delhi':2,'Bhubaneswar':20}
area=pd.Series(ar)

In [17]:
population

Mumbai         200
Bangalore      100
Delhi          200
Bhubaneswar     50
dtype: int64

In [18]:
area

Mumbai         20
Bangalore      12
Delhi           2
Bhubaneswar    20
dtype: int64

In [51]:
census=pd.DataFrame({'population':population,'area':area})

In [52]:
census

Unnamed: 0,population,area
Mumbai,200,20
Bangalore,100,12
Delhi,200,2
Bhubaneswar,50,20


In [53]:
type(census)

pandas.core.frame.DataFrame

In [54]:
census.index

Index(['Mumbai', 'Bangalore', 'Delhi', 'Bhubaneswar'], dtype='object')

In [55]:
census.columns

Index(['population', 'area'], dtype='object')

In [60]:
#Find the maximum area of all city
census['area'].max()


20

In [61]:
type(a)

pandas.core.series.Series

In [63]:
pd.DataFrame([{'a':1,'b':2},{'b':3,'c':6}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,6.0


In [64]:
pd.DataFrame(np.random.rand(3,2),columns=['length','bredth'],index=[1,2,3])

Unnamed: 0,length,bredth
1,0.347757,0.261223
2,0.915517,0.175943
3,0.700797,0.908246


In [65]:
# Features of a index
inda=pd.Index([1,2,3,5,8,10])
indb=pd.Index([3,5,6,2,8])

In [66]:
type(inda)

pandas.core.indexes.numeric.Int64Index

In [70]:
#intersection
inda & indb

Int64Index([2, 3, 5, 8], dtype='int64')

In [71]:
# index union
inda | indb

Int64Index([1, 2, 3, 5, 6, 8, 10], dtype='int64')

In [69]:
# symetric difference
inda ^ indb

Int64Index([1, 6, 10], dtype='int64')

## Implicit & Explicit slicing

In [2]:
data=pd.Series([0.25,0.5,0.75,3.45],index=['a','b','c','d'])

In [3]:
data

a    0.25
b    0.50
c    0.75
d    3.45
dtype: float64

In [4]:
# Explicit slicing
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [5]:
#Implicit slicing
data[0:2]

a    0.25
b    0.50
dtype: float64

In [6]:
#masking
data[(data>0.4) & (data<1)] 

b    0.50
c    0.75
dtype: float64

In [81]:
#fancy indexing
data[['a','d']]

a    0.25
d    3.45
dtype: float64

In [10]:
data.loc['a':"c"]

a    0.25
b    0.50
c    0.75
dtype: float64

In [14]:
data.iloc[0:2]

a    0.25
b    0.50
dtype: float64

Next class:
-Loc iloc
 
 Operation using data frames
    Merging different data frames
    Handling missing values
    Replacing missing values
    Importing external data such csv as dataframe
    Aggeregation & pivot tables
    Vectorizing
    Time series
    Visualization

In [28]:
data=pd.DataFrame({'area':area,"population":population})

In [29]:
data

Unnamed: 0,area,population
Mumbai,20,200
Bangalore,12,100
Delhi,2,200
Bhubaneswar,20,50


In [30]:
data["area"]

Mumbai         20
Bangalore      12
Delhi           2
Bhubaneswar    20
Name: area, dtype: int64

In [37]:
data["population"]

Mumbai         200
Bangalore      100
Delhi          200
Bhubaneswar     50
Name: population, dtype: int64

In [38]:
data["density"]=data["population"]/data["area"]

In [33]:
data

Unnamed: 0,area,population
Mumbai,20,200
Bangalore,12,100
Delhi,2,200
Bhubaneswar,20,50


In [39]:
data.density

Mumbai          10.000000
Bangalore        8.333333
Delhi          100.000000
Bhubaneswar      2.500000
Name: density, dtype: float64

In [42]:
data['area_pop']= data.population*data.area

In [43]:
data

Unnamed: 0,area,population,density,area_pop
Mumbai,20,200,10.0,4000
Bangalore,12,100,8.333333,1200
Delhi,2,200,100.0,400
Bhubaneswar,20,50,2.5,1000


In [44]:
data.area_pop= data.population*data.area

In [45]:
data.values

array([[2.00000000e+01, 2.00000000e+02, 1.00000000e+01, 4.00000000e+03],
       [1.20000000e+01, 1.00000000e+02, 8.33333333e+00, 1.20000000e+03],
       [2.00000000e+00, 2.00000000e+02, 1.00000000e+02, 4.00000000e+02],
       [2.00000000e+01, 5.00000000e+01, 2.50000000e+00, 1.00000000e+03]])

In [47]:
# Transposing the data sets
data.T

Unnamed: 0,Mumbai,Bangalore,Delhi,Bhubaneswar
area,20.0,12.0,2.0,20.0
population,200.0,100.0,200.0,50.0
density,10.0,8.333333,100.0,2.5
area_pop,4000.0,1200.0,400.0,1000.0


In [48]:
data.loc[data.density>10]

Unnamed: 0,area,population,density,area_pop
Delhi,2,200,100.0,400


In [49]:
data.loc[data.area>10,['population','density']]

Unnamed: 0,population,density
Mumbai,200,10.0
Bangalore,100,8.333333
Bhubaneswar,50,2.5


In [51]:
#replace the values in a data set
data.iloc[0,3]

4000

In [52]:
data.iloc[0,3]=1000

In [53]:
data

Unnamed: 0,area,population,density,area_pop
Mumbai,20,200,10.0,1000
Bangalore,12,100,8.333333,1200
Delhi,2,200,100.0,400
Bhubaneswar,20,50,2.5,1000


In [79]:
pop={'Mumbai':200,'Bangalore':100,'Delhi':200,'Bhubaneswar':50,'Pune':10,'Kochi':20}
population=pd.Series(pop)

In [80]:
ar={'Mumbai':20,'Bangalore':12,'Delhi':2,'Bhubaneswar':20,'Kanpur':30}
area=pd.Series(ar)

In [81]:
data=pd.DataFrame({'area':area,"population":population})

In [82]:
data


Unnamed: 0,area,population
Bangalore,12.0,100.0
Bhubaneswar,20.0,50.0
Delhi,2.0,200.0
Kanpur,30.0,
Kochi,,20.0
Mumbai,20.0,200.0
Pune,,10.0


In [61]:
#Index allignment in a panda series
population/area

Bangalore        8.333333
Bhubaneswar      2.500000
Delhi          100.000000
Kanpur                NaN
Kochi                 NaN
Mumbai          10.000000
Pune                  NaN
dtype: float64

In [62]:
population+area

Bangalore      112.0
Bhubaneswar     70.0
Delhi          202.0
Kanpur           NaN
Kochi            NaN
Mumbai         220.0
Pune             NaN
dtype: float64

In [63]:
population.add(area,fill_value=0)

Bangalore      112.0
Bhubaneswar     70.0
Delhi          202.0
Kanpur          30.0
Kochi           20.0
Mumbai         220.0
Pune            10.0
dtype: float64

In [69]:
fill=area.mean()
fill

16.8

In [70]:
area.add(population,fill_value=fill)

Bangalore      112.0
Bhubaneswar     70.0
Delhi          202.0
Kanpur          46.8
Kochi           36.8
Mumbai         220.0
Pune            26.8
dtype: float64

# Operating on null values
- isnull() : output is boolean ( true or false)
- notnull() : opposite of null
- fropna() : dropping the row or columns where there is null values
- fillna() : imputtion

In [73]:
data=pd.Series([1,3,None,'cb',45,645,54,None])

In [74]:
data


0       1
1       3
2    None
3      cb
4      45
5     645
6      54
7    None
dtype: object

In [75]:
data.isnull()

0    False
1    False
2     True
3    False
4    False
5    False
6    False
7     True
dtype: bool

In [76]:
data.notnull()

0     True
1     True
2    False
3     True
4     True
5     True
6     True
7    False
dtype: bool

In [77]:
data[data.notnull()]

0      1
1      3
3     cb
4     45
5    645
6     54
dtype: object

In [78]:
data.dropna()

0      1
1      3
3     cb
4     45
5    645
6     54
dtype: object

In [84]:
data

Unnamed: 0,area,population
Bangalore,12.0,100.0
Bhubaneswar,20.0,50.0
Delhi,2.0,200.0
Kanpur,30.0,
Kochi,,20.0
Mumbai,20.0,200.0
Pune,,10.0


In [85]:
data.isnull()

Unnamed: 0,area,population
Bangalore,False,False
Bhubaneswar,False,False
Delhi,False,False
Kanpur,False,True
Kochi,True,False
Mumbai,False,False
Pune,True,False


In [86]:
data.notnull()

Unnamed: 0,area,population
Bangalore,True,True
Bhubaneswar,True,True
Delhi,True,True
Kanpur,True,False
Kochi,False,True
Mumbai,True,True
Pune,False,True


In [93]:
data

Unnamed: 0,area,population
Bangalore,12.0,100.0
Bhubaneswar,20.0,50.0
Delhi,2.0,200.0
Kanpur,30.0,
Kochi,,20.0
Mumbai,20.0,200.0
Pune,,10.0


In [106]:

data.iloc[3,1]=np.nan
data

Unnamed: 0,area,population
Bangalore,12.0,100.0
Bhubaneswar,20.0,50.0
Delhi,2.0,200.0
Kanpur,30.0,
Kochi,,20.0
Mumbai,20.0,200.0
Pune,,10.0


In [105]:
data.dropna(axis='columns')

Unnamed: 0,population
Bangalore,100.0
Bhubaneswar,50.0
Delhi,200.0
Kanpur,200.0
Kochi,20.0
Mumbai,200.0
Pune,10.0


In [108]:
#imputing the null values with zeros
data.fillna(0)

Unnamed: 0,area,population
Bangalore,12.0,100.0
Bhubaneswar,20.0,50.0
Delhi,2.0,200.0
Kanpur,30.0,0.0
Kochi,0.0,20.0
Mumbai,20.0,200.0
Pune,0.0,10.0


In [110]:
#mputing the null vales with the forward fill
data.fillna(method='ffill')

Unnamed: 0,area,population
Bangalore,12.0,100.0
Bhubaneswar,20.0,50.0
Delhi,2.0,200.0
Kanpur,30.0,200.0
Kochi,30.0,20.0
Mumbai,20.0,200.0
Pune,20.0,10.0


In [112]:
# fill the null values with non zero/mean or any other imputation method
data.fillna(45)

Unnamed: 0,area,population
Bangalore,12.0,100.0
Bhubaneswar,20.0,50.0
Delhi,2.0,200.0
Kanpur,30.0,45.0
Kochi,45.0,20.0
Mumbai,20.0,200.0
Pune,45.0,10.0


In [116]:
data

Unnamed: 0,area,population
Bangalore,12.0,100.0
Bhubaneswar,20.0,50.0
Delhi,2.0,200.0
Kanpur,30.0,
Kochi,,20.0
Mumbai,20.0,200.0
Pune,,10.0


In [117]:
#imputing the null vales with the forward fill
data.fillna(method='ffill')

Unnamed: 0,area,population
Bangalore,12.0,100.0
Bhubaneswar,20.0,50.0
Delhi,2.0,200.0
Kanpur,30.0,200.0
Kochi,30.0,20.0
Mumbai,20.0,200.0
Pune,20.0,10.0


In [114]:
data.fillna(method='bfill')

Unnamed: 0,area,population
Bangalore,12.0,100.0
Bhubaneswar,20.0,50.0
Delhi,2.0,200.0
Kanpur,30.0,20.0
Kochi,20.0,20.0
Mumbai,20.0,200.0
Pune,,10.0
