# Pandas
- Pandas (PANel Data AnalysiS ) is a free, open-source data science library aimed at quick and simplified data munging and exploratory analysis in Python.
- Pandas is built on foundation of Numpy.
- It provides high-level data structures like the DataFrame and Series (one-dimensional structure with an index) which have rich methods for tackling the entire spectrum of data munging tasks. Additionally, pandas has specialized methods for manipulating and visualizing numerical variables and time series data.

## Features
- Easy handling of missing data. ( dropna, fillna, ffill, isnull, notnull ) 
- Simple mutations of tables to add/remove columns on the fly ( assign, drop ) 
- Easy slicing of data (fancy indexing and subsetting) 
- Automatic data alignment (by index ) 
- Powerful split-apply-combine ( groupby ) for aggregations and transformations 
- Intuitive merge/join ( concat, join ) 
- Flexible Reshaping and Pivoting ( stack, pivot ) 
- Hierarchical Labeling of axes indices for working with higherdimensional data 
- Robust I/O tools to work with csv, Excel, flat files, databases and HDFS 
- Time Series Functionality 
- Easy visualization ( plot ) 

## Pandas Data Structures - Series
- Series <br>
It is a 1-d array of data (similar to an array/list/column in a table) with an associated labeled index. <br>
It can be created in the same way as a NumPy array is created <br>
Creating a Series from arrays, lists, dicts, tuples <br>
- Syntax: series(data=, index=, dtype=, name=)

In [6]:
import numpy as np
import pandas as pd

In [3]:
# Creating a series using an Array
x_random = np.random.randn(5)
print (x_random)

# Converting array into series
my_series = pd.Series(x_random)
print (my_series)

[ 1.42695453  0.02782135 -0.30301389  1.3982374  -0.86829547]
0    1.426955
1    0.027821
2   -0.303014
3    1.398237
4   -0.868295
dtype: float64


In [5]:
print (dir(my_series))

['T', '_AXIS_ALIASES', '_AXIS_IALIASES', '_AXIS_LEN', '_AXIS_NAMES', '_AXIS_NUMBERS', '_AXIS_ORDERS', '_AXIS_REVERSED', '_AXIS_SLICEMAP', '__abs__', '__add__', '__and__', '__array__', '__array_prepare__', '__array_priority__', '__array_wrap__', '__bool__', '__bytes__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__div__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__init_subclass__', '__int__', '__invert__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__le__', '__len__', '__long__', '__lt__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pow__', '__radd__', '__rand__', '__rdiv__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmod__', '__rmul__', '__ror__', '__ro

In [9]:
pd.Series(x_random.round(2))

0    1.43
1    0.03
2   -0.30
3    1.40
4   -0.87
dtype: float64

In [11]:
myseries_w_index = pd.Series(x_random.round(2), index = list('12345'))

In [12]:
print (myseries_w_index)

1    1.43
2    0.03
3   -0.30
4    1.40
5   -0.87
dtype: float64


In [13]:
myseries_w_index.index

Index(['1', '2', '3', '4', '5'], dtype='object')

In [14]:
myseries_w_index.values

array([ 1.43,  0.03, -0.3 ,  1.4 , -0.87])

In [None]:
# Creating a Series from a list

In [16]:
pd.Series([1,2,3,4], index = list('0123'), dtype=float, name= 'series_1')

0    1.0
1    2.0
2    3.0
3    4.0
Name: series_1, dtype: float64

In [21]:
# Creating a Series from a tuple
pd.Series((1, 2, 3, 4, 5), index=list('abcde'), name='series_2', dtype=float)

a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
Name: series_2, dtype: float64

In [22]:
# Creating a Series from a dictionary
pd.Series({'a': 1, 'b': 2, 'c':3}, dtype=float, name='series_3')

a    1.0
b    2.0
c    3.0
Name: series_3, dtype: float64

In [27]:
data = pd.Series({'a': 100, 'b': '200', 'c': np.NaN, 
                    'd': None, 'e': 450})
print (data)

a     100
b     200
c     NaN
d    None
e     450
dtype: object


In [28]:
data.tolist()

[100, '200', nan, None, 450]

In [31]:
# Attributes
print (my_series.values)
print (my_series.index)

[ 1.42695453  0.02782135 -0.30301389  1.3982374  -0.86829547]
RangeIndex(start=0, stop=5, step=1)


In [34]:
print (type(my_series.index))

<class 'pandas.core.indexes.range.RangeIndex'>


In [36]:
print (type(my_series.values))
print (my_series.nbytes)

<class 'numpy.ndarray'>
40


> **From Series to list, to dict**

In [37]:
my_series.tolist()

[1.4269545261254104,
 0.027821353347011118,
 -0.30301389007377455,
 1.3982374048496882,
 -0.8682954738484912]

In [39]:
my_series.to_dict()

{0: 1.4269545261254104,
 1: 0.027821353347011118,
 2: -0.30301389007377455,
 3: 1.3982374048496882,
 4: -0.86829547384849115}

In [40]:
myseries_w_index.tolist()

[1.43, 0.03, -0.3, 1.4, -0.87]

In [41]:
myseries_w_index.to_dict()

{'1': 1.4299999999999999,
 '2': 0.029999999999999999,
 '3': -0.29999999999999999,
 '4': 1.3999999999999999,
 '5': -0.87}

> **Series Indexing**

In [45]:
my_series_1 = pd.Series(np.random.randn(5), index=list('abcde')).round(2)

In [46]:
my_series_1

a   -0.13
b   -1.71
c    1.41
d    1.08
e    0.85
dtype: float64

In [48]:
print (my_series_1['a'])

-0.13


In [49]:
print (my_series_1['a':'d'])

a   -0.13
b   -1.71
c    1.41
d    1.08
dtype: float64


In [50]:
print (my_series_1[:'b'])

a   -0.13
b   -1.71
dtype: float64


In [51]:
print (my_series_1[0:3])

a   -0.13
b   -1.71
c    1.41
dtype: float64


In [52]:
print (my_series_1[:2])

a   -0.13
b   -1.71
dtype: float64


In [53]:
print (my_series_1[::-1]) #Reversing the series

e    0.85
d    1.08
c    1.41
b   -1.71
a   -0.13
dtype: float64


In [55]:
my_series_1.loc[['a','c']]

a   -0.13
c    1.41
dtype: float64

In [57]:
my_series_1.iloc[0:2]

a   -0.13
b   -1.71
dtype: float64

In [None]:
#Series Slicing using methods like loc, iloc, ix
# LABEL BASED INDEXER METHOD
#.loc() for label based subsetting 
#.iloc() for integer based subsetting

In [59]:
my_series_1.loc[['a', 'c', 'e']] 

a   -0.13
c    1.41
e    0.85
dtype: float64

In [61]:
# INTEGER BASED INDEXED METHOD
print (my_series_1.iloc[0:2])
print (my_series_1.iloc[:-3])

a   -0.13
b   -1.71
dtype: float64
a   -0.13
b   -1.71
dtype: float64


In [62]:
# MIXED LABELS and INTEGERS BASED INDEXER METHOD
print (my_series_1.ix['a':'c'], my_series.ix[0:2])

a   -0.13
b   -1.71
c    1.41
dtype: float64 0    1.426955
1    0.027821
2   -0.303014
dtype: float64


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  


In [63]:
my_series_1

a   -0.13
b   -1.71
c    1.41
d    1.08
e    0.85
dtype: float64

In [64]:
my_series_1 > 0

a    False
b    False
c     True
d     True
e     True
dtype: bool

In [65]:
my_series_1[my_series_1 > 0]

c    1.41
d    1.08
e    0.85
dtype: float64

In [66]:
my_series_1[[True, True, False, False, True]]

a   -0.13
b   -1.71
e    0.85
dtype: float64

> **Boolean Indexing **

In [67]:
my_series_2 = pd.Series(np.random.randn(15).round(2), index=list('ABC'*5))
print (my_series_2)

A    0.45
B   -1.82
C    1.02
A   -0.49
B    1.15
C    0.87
A    0.78
B   -0.83
C   -0.50
A    0.01
B   -0.49
C    0.27
A   -0.04
B   -1.11
C   -1.63
dtype: float64


In [68]:
my_series_2[my_series_2 > 0]

A    0.45
C    1.02
B    1.15
C    0.87
A    0.78
A    0.01
C    0.27
dtype: float64

In [70]:
print (my_series_2.max())
print (my_series_2.idxmax())

1.15
B


In [72]:
print (my_series_2.ix[my_series_2.idxmax()]) # Select the maximum value from the series

B   -1.82
B    1.15
B   -0.83
B   -0.49
B   -1.11
dtype: float64


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  """Entry point for launching an IPython kernel.


In [73]:
income = pd.Series([100,200,300,400,1000,250,350,450,100])

In [79]:
print (income.shape, '\n', income.head(2),'\n', income.tail(4))

(9,) 
 0    100
1    200
dtype: int64 
 5    250
6    350
7    450
8    100
dtype: int64


## Data processing

In [80]:
np.sqrt(my_series_2)

  """Entry point for launching an IPython kernel.


A    0.670820
B         NaN
C    1.009950
A         NaN
B    1.072381
C    0.932738
A    0.883176
B         NaN
C         NaN
A    0.100000
B         NaN
C    0.519615
A         NaN
B         NaN
C         NaN
dtype: float64

In [81]:
my_series

0    1.426955
1    0.027821
2   -0.303014
3    1.398237
4   -0.868295
dtype: float64

In [82]:
print (my_series/2)

0    0.713477
1    0.013911
2   -0.151507
3    0.699119
4   -0.434148
dtype: float64


In [83]:
my_series > my_series * 2

0    False
1    False
2     True
3    False
4     True
dtype: bool

In [84]:
#Checking values belonging to a list
#1. using in operator
my_series = pd.Series(['sit', 'tat', 'bat', 'rat'])
my_series

0    sit
1    tat
2    bat
3    rat
dtype: object

In [86]:
my_series.isin(['tat'])

0    False
1     True
2    False
3    False
dtype: bool

In [88]:
print (1 in my_series.index)

print ('rat' in my_series.index)

True
False


In [89]:
print ('rat' in my_series.values)

True


In [90]:
#THE .isin() method
pls = pd.Series(['c', 'py', 'java', 'scala', 'swift'])
print (pls)
print (pls.isin(['a', 'b']))
print (pls.isin(['r', 'py', 'vba', 'swift']))
print (pd.Series([X in ['r', 'py', 'vba', 'swift'] for X in pls.values]))
print (pls[pls.isin(['java', 'py'])])

0        c
1       py
2     java
3    scala
4    swift
dtype: object
0    False
1    False
2    False
3    False
4    False
dtype: bool
0    False
1     True
2    False
3    False
4     True
dtype: bool
0    False
1     True
2    False
3    False
4     True
dtype: bool
1      py
2    java
dtype: object


In [94]:
s_1 = pd.Series({'sat': 1, 'rat': 3, 'bat': 2, 'tat': 1, 'vat': 0.5})

In [95]:
print (s_1)
print (pd.Series({k: v + 2 for k, v in s_1.iteritems() if k not in ['tat', 'rat']}))
print ([x + 2 for x in s_1])
print (s_1 +2)

bat    2.0
rat    3.0
sat    1.0
tat    1.0
vat    0.5
dtype: float64
bat    4.0
sat    3.0
vat    2.5
dtype: float64
[4.0, 5.0, 3.0, 3.0, 2.5]
bat    4.0
rat    5.0
sat    3.0
tat    3.0
vat    2.5
dtype: float64


In [97]:
#Re-indexing
my_series = pd.Series(np.random.randn(5), index = list('abcde')).round(2)
print (my_series)
print (my_series.loc[['a', 'y', 'c', 'd', 'e', 'x']])

a   -0.23
b    1.69
c   -1.03
d    0.04
e    1.61
dtype: float64
a   -0.23
y     NaN
c   -1.03
d    0.04
e    1.61
x     NaN
dtype: float64


In [8]:
cars = pd.Series({'Toyota': 100, 'Suzuki': None, 'Hyundai': 600, 
                    'Ford': 700, 'Lexus': 450, 'Kia': None})
print (cars)

# Adding two Series together returns a union of the two Series with the addition occurring on the shared index values. 
# Values on either Series that did not have a shared index will produce a NULL/NaN (not a number).
print ('\n')
print (cars[['Toyota', 'Ford', 'Lexus']])



Ford       700.0
Hyundai    600.0
Kia          NaN
Lexus      450.0
Suzuki       NaN
Toyota     100.0
dtype: float64


Toyota    100.0
Ford      700.0
Lexus     450.0
dtype: float64


In [9]:
print (cars[['Hyundai', 'Lexus']])
print ('\n')
print (cars[['Hyundai', 'Lexus']] + cars[['Toyota', 'Ford', 'Lexus']])

Hyundai    600.0
Lexus      450.0
dtype: float64


Ford         NaN
Hyundai      NaN
Lexus      900.0
Toyota       NaN
dtype: float64


In [17]:
#TYPE CONVERSION
#astype explicitly convert dtypes from one to another
my_series=pd.Series(np.random.randn(1000).round(2))
print (my_series.head())


0   -0.65
1    1.08
2   -0.47
3   -0.29
4   -0.91
dtype: float64


In [19]:
print (my_series.astype(str).head())

0    -0.65
1     1.08
2    -0.47
3    -0.29
4    -0.91
dtype: object


In [20]:
my_series.clip_upper(1)

0     -0.65
1      1.00
2     -0.47
3     -0.29
4     -0.91
5      1.00
6      0.50
7      1.00
8     -2.15
9     -0.76
10     1.00
11    -2.34
12     1.00
13     0.20
14    -0.84
15    -0.02
16    -0.82
17    -0.22
18     0.14
19    -0.30
20    -0.41
21     0.59
22     0.57
23    -0.25
24    -0.87
25     0.34
26     0.14
27    -0.69
28     0.66
29     0.95
       ... 
970    1.00
971   -0.21
972   -0.05
973   -0.85
974   -0.92
975   -0.40
976   -1.73
977   -0.72
978    0.66
979    0.27
980    1.00
981    0.35
982    0.21
983    0.87
984    0.15
985    0.45
986   -1.12
987   -1.43
988    1.00
989    0.34
990   -0.21
991    1.00
992   -0.77
993   -0.29
994   -1.22
995   -0.58
996    0.35
997   -0.42
998   -1.17
999    1.00
Length: 1000, dtype: float64

In [21]:
my_series.clip_lower(-1)

0     -0.65
1      1.08
2     -0.47
3     -0.29
4     -0.91
5      1.12
6      0.50
7      1.45
8     -1.00
9     -0.76
10     1.21
11    -1.00
12     1.06
13     0.20
14    -0.84
15    -0.02
16    -0.82
17    -0.22
18     0.14
19    -0.30
20    -0.41
21     0.59
22     0.57
23    -0.25
24    -0.87
25     0.34
26     0.14
27    -0.69
28     0.66
29     0.95
       ... 
970    1.76
971   -0.21
972   -0.05
973   -0.85
974   -0.92
975   -0.40
976   -1.00
977   -0.72
978    0.66
979    0.27
980    1.03
981    0.35
982    0.21
983    0.87
984    0.15
985    0.45
986   -1.00
987   -1.00
988    1.63
989    0.34
990   -0.21
991    1.40
992   -0.77
993   -0.29
994   -1.00
995   -0.58
996    0.35
997   -0.42
998   -1.00
999    1.84
Length: 1000, dtype: float64

In [22]:
# 1 Handling Outliers
print (my_series.head(10))
print (my_series.head(10).clip_upper(.50))
print (my_series.head(10).clip_lower(.50))

0   -0.65
1    1.08
2   -0.47
3   -0.29
4   -0.91
5    1.12
6    0.50
7    1.45
8   -2.15
9   -0.76
dtype: float64
0   -0.65
1    0.50
2   -0.47
3   -0.29
4   -0.91
5    0.50
6    0.50
7    0.50
8   -2.15
9   -0.76
dtype: float64
0    0.50
1    1.08
2    0.50
3    0.50
4    0.50
5    1.12
6    0.50
7    1.45
8    0.50
9    0.50
dtype: float64


In [36]:
print (my_series.head(10))
print (my_series.head(10).clip_lower(my_series.quantile(0.1)))


0   -0.65
1    1.08
2   -0.47
3   -0.29
4   -0.91
5    1.12
6    0.50
7    1.45
8   -2.15
9   -0.76
dtype: float64
0   -0.650
1    1.080
2   -0.470
3   -0.290
4   -0.910
5    1.120
6    0.500
7    1.450
8   -1.302
9   -0.760
dtype: float64


In [35]:
my_series.quantile(0.1)

-1.302

In [37]:
fruits = pd.Series(['apples', 'oranges', 'peaches', 'mangoes']) 
print (fruits)

0     apples
1    oranges
2    peaches
3    mangoes
dtype: object


In [39]:
fruits.replace({'apples':'grapes', 'peaches':'bananas'})

0     grapes
1    oranges
2    bananas
3    mangoes
dtype: object

In [None]:
# Handle missing values

In [40]:
my_series=pd.Series([1.12, 3.14, np.nan, 6.02, 2.73, None])
print (my_series)

0    1.12
1    3.14
2     NaN
3    6.02
4    2.73
5     NaN
dtype: float64


In [42]:
print (my_series.isnull())

0    False
1    False
2     True
3    False
4    False
5     True
dtype: bool


In [43]:
print (sum(my_series.isnull()))

2


In [48]:
print (my_series.notnull(), '\n', 'sum of not null elements: ', sum(my_series.notnull()))

0     True
1     True
2    False
3     True
4     True
5    False
dtype: bool 
 sum of not null elements:  4


In [49]:
my_series[my_series.notnull()]

0    1.12
1    3.14
3    6.02
4    2.73
dtype: float64

In [50]:
my_series[my_series.isnull()]

2   NaN
5   NaN
dtype: float64

In [51]:
my_series.fillna(0)

0    1.12
1    3.14
2    0.00
3    6.02
4    2.73
5    0.00
dtype: float64

In [53]:
my_series_2 = my_series[my_series.notnull()]

In [54]:
my_series_2

0    1.12
1    3.14
3    6.02
4    2.73
dtype: float64

In [55]:
my_series.fillna(my_series_2.mean())

0    1.1200
1    3.1400
2    3.2525
3    6.0200
4    2.7300
5    3.2525
dtype: float64

In [58]:
my_series.fillna(method = 'ffill')

0    1.12
1    3.14
2    3.14
3    6.02
4    2.73
5    2.73
dtype: float64

In [59]:
my_series.fillna(method = 'bfill')

0    1.12
1    3.14
2    6.02
3    6.02
4    2.73
5     NaN
dtype: float64

In [60]:
# difference between NaN and None
print (type(np.nan)) #It is a mathematical entity
print (type(None)) # It denotes a NA

<class 'float'>
<class 'NoneType'>


In [61]:
my_series = pd.Series(list('abcd' * 3)) 
print (my_series)

0     a
1     b
2     c
3     d
4     a
5     b
6     c
7     d
8     a
9     b
10    c
11    d
dtype: object


In [62]:
my_series.unique() #Finding unique values

array(['a', 'b', 'c', 'd'], dtype=object)

In [63]:
my_series.nunique()

4

In [64]:
my_series.value_counts()

d    3
a    3
b    3
c    3
dtype: int64

In [65]:
my_series.duplicated() 

0     False
1     False
2     False
3     False
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
dtype: bool

In [66]:
my_series.drop_duplicates()

0    a
1    b
2    c
3    d
dtype: object

In [67]:
my_series =  pd.Series(np.random.randint(0, 50, 6), index=list('xyzabc'))
print (my_series)

x    47
y    23
z    28
a    34
b    33
c    43
dtype: int32


In [68]:
print (my_series.idxmax()) #index of largest value in the series

x


In [69]:
print (my_series.idxmin()) #index of the smallest value in the series

y


In [70]:
print (my_series.nlargest(3))

x    47
c    43
a    34
dtype: int32


In [71]:
print (my_series.nsmallest(2))

y    23
z    28
dtype: int32


In [72]:
# Sorting the data/ index in the series
my_series.sort_index()

a    34
b    33
c    43
x    47
y    23
z    28
dtype: int32

In [73]:
# Sorting the data in the series
my_series.sort_values()

y    23
z    28
b    33
a    34
c    43
x    47
dtype: int32

In [80]:
# Mathematical stats
my_series = pd.Series(np.random.randn(100))
print (my_series.head())

0   -0.690495
1    0.768085
2    0.852640
3    0.405734
4    1.087223
dtype: float64


In [81]:
my_series.mean()

-0.05843946482004778

In [82]:
my_series.std()

1.0528916087388633

In [83]:
my_series.count()

100

In [84]:
my_series.sum()

-5.843946482004778

In [88]:
my_series.quantile([0.10, 0.25,0.5,0.75,0.9,0.99,1])

0.10   -1.408396
0.25   -0.810340
0.50   -0.133752
0.75    0.677285
0.90    1.308398
0.99    2.337957
1.00    2.504231
dtype: float64

In [86]:
my_series.nlargest()

76    2.504231
63    2.336277
30    2.151600
51    2.139180
10    1.818626
dtype: float64

In [89]:
my_series.describe()

count    100.000000
mean      -0.058439
std        1.052892
min       -2.816526
25%       -0.810340
50%       -0.133752
75%        0.677285
max        2.504231
dtype: float64

In [91]:
my_series=pd.Series(np.random.randn(1000))
print (my_series)
my_series.plot.hist()

0      0.658742
1      2.352048
2      0.418716
3     -0.717198
4      0.184733
5     -0.175573
6     -1.283064
7     -0.904350
8      0.319464
9     -0.981835
10    -0.253615
11     0.748425
12     0.430934
13     1.145727
14     0.534786
15     1.083289
16     1.597611
17    -0.013498
18     0.675615
19    -1.051956
20     0.324663
21    -1.079374
22    -0.390629
23    -0.355962
24    -2.484589
25     0.874858
26    -1.693141
27    -2.064268
28    -0.491035
29     0.631183
         ...   
970   -0.556615
971   -1.424172
972    0.499777
973    0.462707
974    0.652560
975   -1.309925
976    0.361832
977   -0.579253
978   -0.631011
979   -0.731892
980   -0.870242
981    0.412899
982    0.447525
983    0.365725
984   -0.179239
985    0.726575
986   -1.071497
987   -0.332585
988   -0.364528
989   -0.570820
990    0.350402
991    1.062569
992    0.268639
993   -0.954723
994    0.378401
995   -1.520779
996   -1.173494
997   -0.147597
998   -0.253786
999    0.294150
Length: 1000, dtype: flo

<matplotlib.axes._subplots.AxesSubplot at 0x9de838e400>