#### Pandas data structure.

##### Series

A Series is a one-dimensional array-like object containing a sequence of values (of
similar types to NumPy types) and an associated array of data labels, called its index.
The simplest Series is formed from only an array of data:

In [8]:
import pandas as pd

a=pd.Series([1,2,3,4,5])
a

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [4]:
#one can specify the index of interest

b=pd.Series([100,200,50],index=['India','China','us'])
b

India    100
China    200
us        50
dtype: int64

In [6]:
b['India':'us']

India    100
China    200
us        50
dtype: int64

In [9]:
b[b>100]

China    200
dtype: int64

In [10]:
b.index

Index(['India', 'China', 'us'], dtype='object')

In [11]:
b.values

array([100, 200,  50], dtype=int64)

Should you have data contained in a Python dict, you can create a Series from it by
passing the dict:


In [14]:
pop={'India':1.5,'China':2,'Usa':1.2}
popn=pd.Series(pop)
popn

India    1.5
China    2.0
Usa      1.2
dtype: float64

In [16]:
popltn=pd.Series(pop,index=['Bangla','China','India'])  #here bangla will have null since Bangla is not present in pop dictionary
popltn

Bangla    NaN
China     2.0
India     1.5
dtype: float64

In [17]:
pd.isnull(popltn)

Bangla     True
China     False
India     False
dtype: bool

In [18]:
pd.notnull(popltn)

Bangla    False
China      True
India      True
dtype: bool

In [19]:
popltn.isnull()

Bangla     True
China     False
India     False
dtype: bool

In [20]:
popltn.notnull()

Bangla    False
China      True
India      True
dtype: bool

A useful Series feature for many applications is that it automatically aligns by index
label in arithmetic operations:

In [23]:
popn

India    1.5
China    2.0
Usa      1.2
dtype: float64

In [22]:
popltn

Bangla    NaN
China     2.0
India     1.5
dtype: float64

In [24]:
popn+popltn

Bangla    NaN
China     4.0
India     3.0
Usa       NaN
dtype: float64

Both the Series object itself and its index have a name attribute, which integrates with
other key areas of pandas functionality:

In [25]:
popn

India    1.5
China    2.0
Usa      1.2
dtype: float64

In [26]:
popn.index.name='Country'
popn.name='Population'


In [27]:
popn

Country
India    1.5
China    2.0
Usa      1.2
Name: Population, dtype: float64

A Series’s index can be altered in-place by assignment:


In [31]:
popn.index=['Bob','Harry','Jaggu']
popn

Bob      1.5
Harry    2.0
Jaggu    1.2
Name: Population, dtype: float64

### Pandas

There are many ways to construct a DataFrame, though one of the most common is
from a dict of equal-length lists or NumPy arrays:

In [41]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002, 2003],
 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

df=pd.DataFrame(data)
df

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [42]:
df.head(3)

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6


If you specify a sequence of columns, the DataFrame’s columns will be arranged in
that order:

In [43]:
df1=pd.DataFrame(data,columns=['year','state','pop'])
df1

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [45]:
df3=pd.DataFrame(data,columns=['state','province'])
df3

Unnamed: 0,state,province
0,Ohio,
1,Ohio,
2,Ohio,
3,Nevada,
4,Nevada,
5,Nevada,


In [46]:
df.columns

Index(['state', 'year', 'pop'], dtype='object')

In [48]:
df5=pd.DataFrame(data,index=['one','two','three','four','five','six'])
df5

Unnamed: 0,state,year,pop
one,Ohio,2000,1.5
two,Ohio,2001,1.7
three,Ohio,2002,3.6
four,Nevada,2001,2.4
five,Nevada,2002,2.9
six,Nevada,2003,3.2


In [51]:
df5.iloc[0]

state    Ohio
year     2000
pop       1.5
Name: one, dtype: object

In [52]:
df5.loc['one']

state    Ohio
year     2000
pop       1.5
Name: one, dtype: object

In [61]:
df5.loc['one']['state']

'Ohio'

adding new columns

In [64]:
import numpy as np
df5['debt']=np.arange(6.)
df5

Unnamed: 0,state,year,pop,debt
one,Ohio,2000,1.5,0.0
two,Ohio,2001,1.7,1.0
three,Ohio,2002,3.6,2.0
four,Nevada,2001,2.4,3.0
five,Nevada,2002,2.9,4.0
six,Nevada,2003,3.2,5.0


In [66]:
df3

Unnamed: 0,state,province
0,Ohio,
1,Ohio,
2,Ohio,
3,Nevada,
4,Nevada,
5,Nevada,


In [74]:
df3['eastern']=df3['state']=='Ohio'
df3

Unnamed: 0,state,province,eastern
0,Ohio,,True
1,Ohio,,True
2,Ohio,,True
3,Nevada,,False
4,Nevada,,False
5,Nevada,,False


deleting new columns

In [75]:
del df3['eastern']
df3

Unnamed: 0,state,province
0,Ohio,
1,Ohio,
2,Ohio,
3,Nevada,
4,Nevada,
5,Nevada,


The column returned from indexing a DataFrame is a view on the
underlying data, not a copy. Thus, any in-place modifications to the
Series will be reflected in the DataFrame. The column can be
explicitly copied with the Series’s copy method.

Another common form of data is a nested dict of dicts:If the nested dict is passed to the DataFrame, pandas will interpret the outer dict keys
as the columns and the inner keys as the row indices:

In [78]:
pop={'Noida':{'2000':24567,'2001':345678},'Mizorama':{'2000':23456,'2001':34567}}

pop1=pd.DataFrame(pop)
pop1

Unnamed: 0,Noida,Mizorama
2000,24567,23456
2001,345678,34567


In [79]:
pop1.T

Unnamed: 0,2000,2001
Noida,24567,345678
Mizorama,23456,34567


In [81]:
pop1.index.name='year'
pop1.columns.name="State"
pop1

State,Noida,Mizorama
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,24567,23456
2001,345678,34567


In [82]:
a=pd.Series([1,2,3,4])
b=pd.Series([5,6,7,8])
c=pd.Series(['9','10','11'])

pd.DataFrame([a,b,c])

Unnamed: 0,0,1,2,3
0,1,2,3,4.0
1,5,6,7,8.0
2,9,10,11,


Unlike Python sets, a pandas Index can contain duplicate labels:

In [85]:

dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

### Reindexing

reindex is not inplace operation

In [87]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [91]:
obj2=obj.reindex(['a','b','c','d','e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

For ordered data like time series, it may be desirable to do some interpolation or fill‐
ing of values when reindexing. The method option allows us to do this, using a
method such as ffill, which forward-fills the values:

In [92]:
a=pd.Series(['orange','blue','yellow'],index=[0,2,4])
a

0    orange
2      blue
4    yellow
dtype: object

In [93]:
a.reindex(np.arange(6),method='ffill')

0    orange
1    orange
2      blue
3      blue
4    yellow
5    yellow
dtype: object

In [94]:
a

0    orange
2      blue
4    yellow
dtype: object

In [95]:
a.reindex(np.arange(6))

0    orange
1       NaN
2      blue
3       NaN
4    yellow
5       NaN
dtype: object

With DataFrame, reindex can alter either the (row) index, columns, or both. When
passed only a sequence, it reindexes the rows in the result:

In [108]:
df=pd.DataFrame(np.arange(9,dtype='int64').reshape(3,3),index=['a','c','b'],columns=['kl','ka','tl'])
df

Unnamed: 0,kl,ka,tl
a,0,1,2
c,3,4,5
b,6,7,8


In [116]:
df.reindex(['a','b','c','e'])

Unnamed: 0,kl,ka,tl
a,0.0,1.0,2.0
b,6.0,7.0,8.0
c,3.0,4.0,5.0
e,,,


In [119]:
df.reindex(columns=['ka','kl','up'])

Unnamed: 0,ka,kl,up
a,1.0,0.0,
b,7.0,6.0,
c,4.0,3.0,
e,,,


### Droping etnries from desired axis

In [2]:
import numpy as np
import pandas as pd
s=pd.Series(np.arange(5),index=['a','b','c','d','e'])
s

a    0
b    1
c    2
d    3
e    4
dtype: int32

In [3]:
v=s.drop(['c'])
v

a    0
b    1
d    3
e    4
dtype: int32

In [4]:
s.drop(['a','b'])

c    2
d    3
e    4
dtype: int32

In [6]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [7]:
data.drop(['Ohio','Colorado'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [9]:
data.drop(['one','two'],axis=1)

Unnamed: 0,three,four
Ohio,2,3
Colorado,6,7
Utah,10,11
New York,14,15


In [12]:
data.drop(['one','two'],axis='columns')

Unnamed: 0,three,four
Ohio,2,3
Colorado,6,7
Utah,10,11
New York,14,15


#### Indexing, Selection, and Filtering

In [13]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

Slicing with labels behaves differently than normal Python slicing in that the end‐
point is inclusive:

In [14]:
obj['a':'c']

a    0.0
b    1.0
c    2.0
dtype: float64

In [15]:
obj[0:2]

a    0.0
b    1.0
dtype: float64

In [16]:
obj[obj>2]

d    3.0
dtype: float64

In [18]:
obj['b':'c']=5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

In [17]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [19]:
data['one']

Ohio         0
Colorado     4
Utah         8
New York    12
Name: one, dtype: int32

In [22]:
data[['one','four']]

Unnamed: 0,one,four
Ohio,0,3
Colorado,4,7
Utah,8,11
New York,12,15


In [24]:
data.iloc[0]

one      0
two      1
three    2
four     3
Name: Ohio, dtype: int32

In [26]:
data.iloc[0:3]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11


In [27]:
data['Ohio':'Utah']

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11


In [33]:
data.loc[:,'two':'four']

Unnamed: 0,two,three,four
Ohio,1,2,3
Colorado,5,6,7
Utah,9,10,11
New York,13,14,15


In [34]:
data.loc['Colorado',['two','four']]

two     5
four    7
Name: Colorado, dtype: int32

In [35]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [36]:
data.iloc[:,:3]

Unnamed: 0,one,two,three
Ohio,0,1,2
Colorado,4,5,6
Utah,8,9,10
New York,12,13,14


In [37]:
data.iloc[:,:3][data.three>6]

Unnamed: 0,one,two,three
Utah,8,9,10
New York,12,13,14


In [44]:
data.loc['Ohio']

one      0
two      1
three    2
four     3
Name: Ohio, dtype: int32

In [38]:
df=pd.DataFrame(np.arange(9).reshape(3,3),columns=['a','b','c'])
df

Unnamed: 0,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8


In [41]:
df.loc[0:2,'a':'c']

Unnamed: 0,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8


In [42]:
df[df['a']>3]

Unnamed: 0,a,b,c
2,6,7,8


In [49]:
df.loc[0,'a':'c']

a    0
b    1
c    2
Name: 0, dtype: int32

to access single column using label use .loc

to access single row using label dont use .loc jut pass name

In [50]:
df[['a','c']]

Unnamed: 0,a,c
0,0,2
1,3,5
2,6,8


In [51]:
df

Unnamed: 0,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8


In [52]:
df.iloc[[0,2]]

Unnamed: 0,a,b,c
0,0,1,2
2,6,7,8


### Arithmetic and Data Alignment

In [53]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])

In [54]:
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
index=['a', 'c', 'e', 'f', 'g'])
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [55]:
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [56]:
s1+s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

The internal data alignment introduces missing values in the label locations that don’t
overlap. Missing values will then propagate in further arithmetic computations.

In [63]:
a=np.arange(9).reshape(3,3)
b=np.arange(3)
a

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [68]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('dcb'),
index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [69]:
df1

Unnamed: 0,d,c,b
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [70]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [71]:
df1+df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,5.0,,4.0,
Oregon,,,,
Texas,11.0,,10.0,
Utah,,,,


Since the 'c' and 'e' columns are not found in both DataFrame objects, they appear
as all missing in the result. The same holds for the rows whose labels are not common
to both objects.

In [74]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
columns=list('abcde'))
df2.loc[1,'b']=np.nan

In [75]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [76]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [77]:
df1+df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [78]:
df1.add(df2,fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [81]:
df1.rdiv(2)

Unnamed: 0,a,b,c,d
0,inf,2.0,1.0,0.666667
1,0.5,0.4,0.333333,0.285714
2,0.25,0.222222,0.2,0.181818


In [83]:
df1.div(2)

Unnamed: 0,a,b,c,d
0,0.0,0.5,1.0,1.5
1,2.0,2.5,3.0,3.5
2,4.0,4.5,5.0,5.5


In [91]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [93]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [92]:
df1.div(df2)

Unnamed: 0,a,b,c,d,e
0,,1.0,1.0,1.0,
1,0.8,,0.857143,0.875,
2,0.8,0.818182,0.833333,0.846154,
3,,,,,


In [94]:
df1.div(df2,fill_value=1)

Unnamed: 0,a,b,c,d,e
0,,1.0,1.0,1.0,0.25
1,0.8,5.0,0.857143,0.875,0.111111
2,0.8,0.818182,0.833333,0.846154,0.071429
3,0.066667,0.0625,0.058824,0.055556,0.052632


### Operation between dataframe and series

In [5]:
import pandas as pd
import numpy as np
a=pd.DataFrame(np.arange(12).reshape(4,-1),columns=list('bde'),index=['Utah', 'Ohio', 'Texas', 'Oregon'])
a

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [6]:
srs=a.iloc[0]
srs

b    0
d    1
e    2
Name: Utah, dtype: int32

In [7]:
a+srs

Unnamed: 0,b,d,e
Utah,0,2,4
Ohio,3,5,7
Texas,6,8,10
Oregon,9,11,13


no what if column indexes are different

In [10]:
srs=pd.Series(a.iloc[0],index=list('bef'))
a+srs

Unnamed: 0,b,d,e,f
Utah,0.0,,4.0,
Ohio,3.0,,7.0,
Texas,6.0,,10.0,
Oregon,9.0,,13.0,


If you want to instead broadcast over the columns, matching on the rows, you have to
use one of the arithmetic methods. For example:

In [12]:
a

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [13]:
sr1=a['b']
sr1

Utah      0
Ohio      3
Texas     6
Oregon    9
Name: b, dtype: int32

In [19]:
a.sub(sr1,axis=0)

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,0,1,2
Texas,0,1,2
Oregon,0,1,2


In [20]:
a-sr1

Unnamed: 0,Ohio,Oregon,Texas,Utah,b,d,e
Utah,,,,,,,
Ohio,,,,,,,
Texas,,,,,,,
Oregon,,,,,,,


#### Function Application and Mapping

In [46]:
frame=pd.DataFrame(np.random.randn(4,3) ,columns=list('abc'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,a,b,c
Utah,0.961067,0.406398,0.221388
Ohio,0.50036,-1.475588,0.344538
Texas,-0.884618,0.169209,-1.035132
Oregon,1.288274,-0.92244,-1.512648


In [23]:
np.abs(frame)

Unnamed: 0,a,b,c
Utah,2.041544,1.567412,0.443002
Ohio,0.594733,0.250342,1.694535
Texas,0.091908,0.671938,0.883535
Oregon,0.358227,0.038192,1.27865


Another frequent operation is applying a function on one-dimensional arrays to each
column or row. DataFrame’s apply method does exactly this:

In [24]:
frame.apply(lambda x:max(x)-min(x))

a    2.636277
b    2.239350
c    1.251533
dtype: float64

Here the function f, which computes the difference between the maximum and mini‐
mum of a Series, is invoked once on each column in frame. The result is a Series hav‐
ing the columns of frame as its index.
If you pass axis='columns' to apply, the function will be invoked once per row
instead:

In [26]:
frame.apply(lambda x:max(x)-min(x),axis=1)

Utah      2.484547
Ohio      1.944877
Texas     0.791627
Oregon    1.636877
dtype: float64

In [31]:
frame

Unnamed: 0,a,b,c
Utah,2.041544,1.567412,-0.443002
Ohio,-0.594733,0.250342,-1.694535
Texas,-0.091908,-0.671938,-0.883535
Oregon,0.358227,-0.038192,-1.27865


In [30]:
def min_mean(x):
    return pd.Series([np.mean(x),min(x)],index=['mean','minimum'])
frame.apply(min_mean)

Unnamed: 0,a,b,c
mean,0.428283,0.276906,-1.07493
minimum,-0.594733,-0.671938,-1.694535


Element-wise Python functions can be used, too. Suppose you wanted to compute a
formatted string from each floating-point value in frame. You can do this with apply
map

In [34]:
def f(x): return "%.2f"%x
frame.applymap(f)

Unnamed: 0,a,b,c
Utah,2.04,1.57,-0.44
Ohio,-0.59,0.25,-1.69
Texas,-0.09,-0.67,-0.88
Oregon,0.36,-0.04,-1.28


The reason for the name applymap is that Series has a map method for applying an
element-wise function

In [35]:
frame['a'].map(lambda x:"%.2f"%x)

Utah       2.04
Ohio      -0.59
Texas     -0.09
Oregon     0.36
Name: a, dtype: object

In [36]:
frame.iloc[0].map(lambda x:"%.2f"%x)

a     2.04
b     1.57
c    -0.44
Name: Utah, dtype: object

### Sorting and Ranking

In [39]:
srs=pd.Series(np.arange(4),index=['b','c','d','a'])
srs

b    0
c    1
d    2
a    3
dtype: int32

In [40]:
srs.sort_index()

a    3
b    0
c    1
d    2
dtype: int32

In [42]:
srs.sort_values()

b    0
c    1
d    2
a    3
dtype: int32

In [47]:
frame=frame.reindex(['b','c','a'],axis='columns')
frame

Unnamed: 0,b,c,a
Utah,0.406398,0.221388,0.961067
Ohio,-1.475588,0.344538,0.50036
Texas,0.169209,-1.035132,-0.884618
Oregon,-0.92244,-1.512648,1.288274


In [48]:
frame.sort_index()

Unnamed: 0,b,c,a
Ohio,-1.475588,0.344538,0.50036
Oregon,-0.92244,-1.512648,1.288274
Texas,0.169209,-1.035132,-0.884618
Utah,0.406398,0.221388,0.961067


In [49]:
frame.sort_index(axis='columns')

Unnamed: 0,a,b,c
Utah,0.961067,0.406398,0.221388
Ohio,0.50036,-1.475588,0.344538
Texas,-0.884618,0.169209,-1.035132
Oregon,1.288274,-0.92244,-1.512648


In [53]:
frame=pd.DataFrame({'b':[4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [56]:
frame.sort_values(by='b')

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


To sort by multiple columns, pass a list of names:


In [57]:
frame.sort_values(by='a')

Unnamed: 0,b,a
0,4,0
2,-3,0
1,7,1
3,2,1


In [59]:
frame=frame.reindex(['a','b'],axis=1)
frame

Unnamed: 0,a,b
0,0,4
1,1,7
2,0,-3
3,1,2


In [61]:
frame.sort_values(by=['a'])

Unnamed: 0,a,b
0,0,4
2,0,-3
1,1,7
3,1,2


In [65]:
frame.sort_values(by=['a','b'],ascending=False)

Unnamed: 0,a,b
1,1,7
3,1,2
0,0,4
2,0,-3


##### rank

In [75]:
a=pd.Series(np.random.randint(15,size=5))
a

0    10
1    10
2    13
3    11
4    12
dtype: int32

In [76]:
a.rank()

0    1.5
1    1.5
2    5.0
3    3.0
4    4.0
dtype: float64

In [78]:
a.rank(method='first')

0    1.0
1    2.0
2    5.0
3    3.0
4    4.0
dtype: float64

In [79]:
a.rank(method='max')

0    2.0
1    2.0
2    5.0
3    3.0
4    4.0
dtype: float64

In [80]:
a.rank(ascending=False,method='max')

0    5.0
1    5.0
2    1.0
3    3.0
4    2.0
dtype: float64

In [81]:
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],'c': [-2, 5, 8, -2.5]})
frame.rank(axis='columns')

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


##### Axis Indexes with Duplicate Labels.

In [82]:
srs=pd.Series(np.arange(4),index=['a','b','b','c'])
srs

a    0
b    1
b    2
c    3
dtype: int32

In [83]:
srs.unique()

array([0, 1, 2, 3])

In [84]:
srs.index.unique()

Index(['a', 'b', 'c'], dtype='object')

In [87]:
srs.index.is_unique

False

In [88]:
srs['b']

b    1
b    2
dtype: int32

In [89]:
df=pd.DataFrame(np.random.randn(12).reshape(4,-1),index=['a','a','b','b'])
df

Unnamed: 0,0,1,2
a,-1.684105,0.379841,-0.646891
a,0.344951,0.071731,-0.242449
b,0.024191,0.338859,-0.255574
b,1.289235,1.503367,-0.24072


In [90]:
df.index

Index(['a', 'a', 'b', 'b'], dtype='object')

In [91]:
df.columns

RangeIndex(start=0, stop=3, step=1)

In [93]:
df.index.is_unique

False

#### Summarizing and Computing Descriptive Statistics

In [2]:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
[np.nan, np.nan], [0.75, -1.3]],
index=['a', 'b', 'c', 'd'],
columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [3]:
df.sum()  #returns column wise sum

one    9.25
two   -5.80
dtype: float64

In [4]:
df.sum().sum()  #sum of entire dataframe

3.45

In [5]:
df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [6]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


NA values are excluded unless the entire slice (row or column in this case) is NA.
This can be disabled with the skipna option:

In [8]:
df.sum(axis=1,skipna=False)

a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

In [16]:
df.idxmin(axis=1)

a    one
b    two
c    NaN
d    two
dtype: object

In [27]:
df.idxmax()

one    b
two    d
dtype: object

In [28]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [32]:
np.argmin(df['one'])

3

In [33]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [37]:
df.sort_values(by='one')

Unnamed: 0,one,two
d,0.75,-1.3
a,1.4,
b,7.1,-4.5
c,,


In [38]:
df.mean()

one    3.083333
two   -2.900000
dtype: float64

In [39]:
df.cummin()

Unnamed: 0,one,two
a,1.4,
b,1.4,-4.5
c,,
d,0.75,-4.5


In [41]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [42]:
df['one'].idxmin()

'd'

#### corr and covariance

In [43]:
df['one'].corr(df['two'])

-1.0

In [44]:
df['one'].cov(df['two'])

-10.16

In [45]:
df.corr()

Unnamed: 0,one,two
one,1.0,-1.0
two,-1.0,1.0


In [46]:
df.cov()

Unnamed: 0,one,two
one,12.205833,-10.16
two,-10.16,5.12


### Unique Values, Value Counts, and Membership

In [47]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [48]:
np.unique(obj)

array(['a', 'b', 'c', 'd'], dtype=object)

In [50]:
obj.unique()  #is not sorted

array(['c', 'a', 'd', 'b'], dtype=object)

In [59]:
sorted(obj.unique()
)


['a', 'b', 'c', 'd']

In [61]:
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

In [69]:
obj.values[:3]

array(['c', 'a', 'd'], dtype=object)

In [70]:
obj.isin(['a','c'])

0     True
1     True
2    False
3     True
4     True
5    False
6    False
7     True
8     True
dtype: bool

In [71]:
obj[obj.isin(['a','c'])]

0    c
1    a
3    a
4    a
7    c
8    c
dtype: object

In some cases, you may want to compute a histogram on multiple related columns in
a DataFrame. Here’s an example.

In [73]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
'Qu2': [2, 3, 1, 2, 3],
'Qu3': [1, 5, 2, 4, 4]})
data


Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [74]:
data.apply(pd.value_counts)

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,,2.0,1.0
3,2.0,2.0,
4,2.0,,2.0
5,,,1.0


In [75]:
data.apply(pd.value_counts).fillna(0)

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


In [79]:
data.apply(pd.unique)

Qu1       [1, 3, 4]
Qu2       [2, 3, 1]
Qu3    [1, 5, 2, 4]
dtype: object

In [84]:
data.apply(pd.value_counts).fillna(0)

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0
