In [1]:
import pandas as pd
import numpy as np

In [2]:
pd.__version__

'0.23.0'

## Series

### Series from python list

In [4]:
data=pd.Series([1.2,2.5,3.7,4.6])
data

0    1.2
1    2.5
2    3.7
3    4.6
dtype: float64

- Values of a series is a numpy array

In [5]:
data.values

array([1.2, 2.5, 3.7, 4.6])

In [8]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [10]:
data[1:3]

1    2.5
2    3.7
dtype: float64

- Index can be any value i.e. it need not be an integer as in numpy array

In [11]:
data=pd.Series([1.2,2.5,3.7,4.6],index=['a','b','c','d'])
data

a    1.2
b    2.5
c    3.7
d    4.6
dtype: float64

In [12]:
data['a']

1.2

If data is scalar, it is copied to fill the required number of indices.

In [2]:
data=pd.Series(5,index=[100,200,300])
data

100    5
200    5
300    5
dtype: int64

### Series from python dict

In [17]:
population_dict = {'California': 38332521,
                           'Texas': 26448193,
                           'New York': 19651127,
                           'Florida': 19552860,
                           'Illinois': 12882135}
population=pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [15]:
data['Illinois']

12882135

## Dataframe

In [18]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area=pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

### Dataframe from dict of series

In [20]:
states=pd.DataFrame({'population':population,'area':area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [21]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [22]:
states.columns

Index(['population', 'area'], dtype='object')

In [23]:
states['population']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

### Dataframe from single series object

In [4]:
population_dict = {'California': 38332521,
                           'Texas': 26448193,
                           'New York': 19651127,
                           'Florida': 19552860,
                           'Illinois': 12882135}
population=pd.Series(population_dict)
data=pd.DataFrame(population,columns=['population'])
data

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


### Dataframe from list of dicts

In [5]:
dicts=[{'a':i,'b':2*i} for i in range(5)]
data=pd.DataFrame(dicts)
data

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4
3,3,6
4,4,8


### Dataframe from numpy array

In [6]:
data=pd.DataFrame(np.random.rand(4,3),index=['a','b','c','d'],columns=['x','y','z'])
data

Unnamed: 0,x,y,z
a,0.976073,0.928214,0.627709
b,0.427427,0.279999,0.897681
c,0.204901,0.230496,0.49555
d,0.416621,0.858421,0.436123


If no index and column names are specified, integer indexing is used as default.

In [7]:
data=pd.DataFrame(np.random.rand(4,3))
data

Unnamed: 0,0,1,2
0,0.874864,0.960493,0.200197
1,0.102406,0.603007,0.398259
2,0.206944,0.500375,0.619709
3,0.700153,0.639306,0.951181


## Index

In [9]:
index=pd.Index([2,4,6,7,9])
index

Int64Index([2, 4, 6, 7, 9], dtype='int64')

Pandas index array is immutable

In [10]:
index[3]

7

In [11]:
index[3]=4

TypeError: Index does not support mutable operations

Index can also be treated as a set and set operaions can be performed.

In [12]:
index2=pd.Index([1,3,5,6,8])
index2

Int64Index([1, 3, 5, 6, 8], dtype='int64')

In [15]:
index3=index | index2
index3

Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

## Indexing

There can be two types of indexing in dataframes:
1. Explicit indexing like 'a','California'
2. Implicit indexing like 0..n

Slicing and indexing can be confusing as pandas uses **explicit** index for indexing and **implicit** index for slicing. 

Pandas provides some indexer attributes for this problem.
1. **loc** attribute allows indexing and slicing that always references the **explicit** index.
2. **iloc** attribute allows indexing and slicing that always references the **implicit** Python-style index

In [17]:
data=pd.Series(['a','b','c'],index=[1,3,5])
data

1    a
3    b
5    c
dtype: object

Explicit indexing

In [18]:
data[1]

'a'

Implicit indexing

In [19]:
data[1:3]

3    b
5    c
dtype: object

In [20]:
data.loc[1]

'a'

In [21]:
data.loc[1:3]

1    a
3    b
dtype: object

In [22]:
data.iloc[1]

'b'

In [23]:
data.iloc[1:3]

3    b
5    c
dtype: object

### Indexing in dataframe

In [25]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                          'New York': 141297, 'Florida': 170312,
                          'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                         'New York': 19651127, 'Florida': 19552860,
                         'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [26]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

* This type of indexing should not be used as it can fail if column names are not strings or column name is same as some function.

In [27]:
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

We can extract the underlying numpy array by `values` attribute

In [28]:
data.values

array([[  423967, 38332521],
       [  695662, 26448193],
       [  141297, 19651127],
       [  170312, 19552860],
       [  149995, 12882135]])

In [29]:
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967,695662,141297,170312,149995
pop,38332521,26448193,19651127,19552860,12882135


In [30]:
data.iloc[:2,:3]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193


In [32]:
data.loc[:'Florida',:'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860


### Index alignment in dataframe

In [34]:
rng=np.random.RandomState(35)
A=pd.DataFrame(rng.randint(0,10,(4,4)),columns=list('ABCD'))
A

Unnamed: 0,A,B,C,D
0,9,7,1,0
1,9,8,8,8
2,9,7,7,8
3,0,9,2,5


In [37]:
B=pd.DataFrame(rng.randint(0,20,(5,4)),columns=list('BCDA'))
B

Unnamed: 0,B,C,D,A
0,3,16,12,5
1,12,5,4,2
2,0,11,9,5
3,4,18,18,11
4,4,3,15,1


While performing any operation, indexes of a dataframe automatically align.

In [38]:
A + B

Unnamed: 0,A,B,C,D
0,14.0,10.0,17.0,12.0
1,11.0,20.0,13.0,12.0
2,14.0,7.0,18.0,17.0
3,11.0,13.0,20.0,23.0
4,,,,
