#### Pandas data structure.

##### Series

A Series is a one-dimensional array-like object containing a sequence of values (of
similar types to NumPy types) and an associated array of data labels, called its index.
The simplest Series is formed from only an array of data:

In [8]:
import pandas as pd

a=pd.Series([1,2,3,4,5])
a

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [4]:
#one can specify the index of interest

b=pd.Series([100,200,50],index=['India','China','us'])
b

India    100
China    200
us        50
dtype: int64

In [6]:
b['India':'us']

India    100
China    200
us        50
dtype: int64

In [9]:
b[b>100]

China    200
dtype: int64

In [10]:
b.index

Index(['India', 'China', 'us'], dtype='object')

In [11]:
b.values

array([100, 200,  50], dtype=int64)

Should you have data contained in a Python dict, you can create a Series from it by
passing the dict:


In [14]:
pop={'India':1.5,'China':2,'Usa':1.2}
popn=pd.Series(pop)
popn

India    1.5
China    2.0
Usa      1.2
dtype: float64

In [16]:
popltn=pd.Series(pop,index=['Bangla','China','India'])  #here bangla will have null since Bangla is not present in pop dictionary
popltn

Bangla    NaN
China     2.0
India     1.5
dtype: float64

In [17]:
pd.isnull(popltn)

Bangla     True
China     False
India     False
dtype: bool

In [18]:
pd.notnull(popltn)

Bangla    False
China      True
India      True
dtype: bool

In [19]:
popltn.isnull()

Bangla     True
China     False
India     False
dtype: bool

In [20]:
popltn.notnull()

Bangla    False
China      True
India      True
dtype: bool

A useful Series feature for many applications is that it automatically aligns by index
label in arithmetic operations:

In [23]:
popn

India    1.5
China    2.0
Usa      1.2
dtype: float64

In [22]:
popltn

Bangla    NaN
China     2.0
India     1.5
dtype: float64

In [24]:
popn+popltn

Bangla    NaN
China     4.0
India     3.0
Usa       NaN
dtype: float64

Both the Series object itself and its index have a name attribute, which integrates with
other key areas of pandas functionality:

In [25]:
popn

India    1.5
China    2.0
Usa      1.2
dtype: float64

In [26]:
popn.index.name='Country'
popn.name='Population'


In [27]:
popn

Country
India    1.5
China    2.0
Usa      1.2
Name: Population, dtype: float64

A Series’s index can be altered in-place by assignment:


In [31]:
popn.index=['Bob','Harry','Jaggu']
popn

Bob      1.5
Harry    2.0
Jaggu    1.2
Name: Population, dtype: float64

### Pandas

There are many ways to construct a DataFrame, though one of the most common is
from a dict of equal-length lists or NumPy arrays:

In [41]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002, 2003],
 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

df=pd.DataFrame(data)
df

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [42]:
df.head(3)

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6


If you specify a sequence of columns, the DataFrame’s columns will be arranged in
that order:

In [43]:
df1=pd.DataFrame(data,columns=['year','state','pop'])
df1

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [45]:
df3=pd.DataFrame(data,columns=['state','province'])
df3

Unnamed: 0,state,province
0,Ohio,
1,Ohio,
2,Ohio,
3,Nevada,
4,Nevada,
5,Nevada,


In [46]:
df.columns

Index(['state', 'year', 'pop'], dtype='object')

In [48]:
df5=pd.DataFrame(data,index=['one','two','three','four','five','six'])
df5

Unnamed: 0,state,year,pop
one,Ohio,2000,1.5
two,Ohio,2001,1.7
three,Ohio,2002,3.6
four,Nevada,2001,2.4
five,Nevada,2002,2.9
six,Nevada,2003,3.2


In [51]:
df5.iloc[0]

state    Ohio
year     2000
pop       1.5
Name: one, dtype: object

In [52]:
df5.loc['one']

state    Ohio
year     2000
pop       1.5
Name: one, dtype: object

In [61]:
df5.loc['one']['state']

'Ohio'

adding new columns

In [64]:
import numpy as np
df5['debt']=np.arange(6.)
df5

Unnamed: 0,state,year,pop,debt
one,Ohio,2000,1.5,0.0
two,Ohio,2001,1.7,1.0
three,Ohio,2002,3.6,2.0
four,Nevada,2001,2.4,3.0
five,Nevada,2002,2.9,4.0
six,Nevada,2003,3.2,5.0


In [66]:
df3

Unnamed: 0,state,province
0,Ohio,
1,Ohio,
2,Ohio,
3,Nevada,
4,Nevada,
5,Nevada,


In [74]:
df3['eastern']=df3['state']=='Ohio'
df3

Unnamed: 0,state,province,eastern
0,Ohio,,True
1,Ohio,,True
2,Ohio,,True
3,Nevada,,False
4,Nevada,,False
5,Nevada,,False


deleting new columns

In [75]:
del df3['eastern']
df3

Unnamed: 0,state,province
0,Ohio,
1,Ohio,
2,Ohio,
3,Nevada,
4,Nevada,
5,Nevada,


The column returned from indexing a DataFrame is a view on the
underlying data, not a copy. Thus, any in-place modifications to the
Series will be reflected in the DataFrame. The column can be
explicitly copied with the Series’s copy method.

Another common form of data is a nested dict of dicts:If the nested dict is passed to the DataFrame, pandas will interpret the outer dict keys
as the columns and the inner keys as the row indices:

In [78]:
pop={'Noida':{'2000':24567,'2001':345678},'Mizorama':{'2000':23456,'2001':34567}}

pop1=pd.DataFrame(pop)
pop1

Unnamed: 0,Noida,Mizorama
2000,24567,23456
2001,345678,34567


In [79]:
pop1.T

Unnamed: 0,2000,2001
Noida,24567,345678
Mizorama,23456,34567


In [81]:
pop1.index.name='year'
pop1.columns.name="State"
pop1

State,Noida,Mizorama
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,24567,23456
2001,345678,34567


In [82]:
a=pd.Series([1,2,3,4])
b=pd.Series([5,6,7,8])
c=pd.Series(['9','10','11'])

pd.DataFrame([a,b,c])

Unnamed: 0,0,1,2,3
0,1,2,3,4.0
1,5,6,7,8.0
2,9,10,11,


Unlike Python sets, a pandas Index can contain duplicate labels:

In [85]:

dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

### Reindexing

reindex is not inplace operation

In [87]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [91]:
obj2=obj.reindex(['a','b','c','d','e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

For ordered data like time series, it may be desirable to do some interpolation or fill‐
ing of values when reindexing. The method option allows us to do this, using a
method such as ffill, which forward-fills the values:

In [92]:
a=pd.Series(['orange','blue','yellow'],index=[0,2,4])
a

0    orange
2      blue
4    yellow
dtype: object

In [93]:
a.reindex(np.arange(6),method='ffill')

0    orange
1    orange
2      blue
3      blue
4    yellow
5    yellow
dtype: object

In [94]:
a

0    orange
2      blue
4    yellow
dtype: object

In [95]:
a.reindex(np.arange(6))

0    orange
1       NaN
2      blue
3       NaN
4    yellow
5       NaN
dtype: object

With DataFrame, reindex can alter either the (row) index, columns, or both. When
passed only a sequence, it reindexes the rows in the result:

In [108]:
df=pd.DataFrame(np.arange(9,dtype='int64').reshape(3,3),index=['a','c','b'],columns=['kl','ka','tl'])
df

Unnamed: 0,kl,ka,tl
a,0,1,2
c,3,4,5
b,6,7,8


In [116]:
df.reindex(['a','b','c','e'])

Unnamed: 0,kl,ka,tl
a,0.0,1.0,2.0
b,6.0,7.0,8.0
c,3.0,4.0,5.0
e,,,


In [119]:
df.reindex(columns=['ka','kl','up'])

Unnamed: 0,ka,kl,up
a,1.0,0.0,
b,7.0,6.0,
c,4.0,3.0,
e,,,
