**Pandas is a Data Manipulative tool. Which helps to clean the data and design it in Python.**

In [1]:
# Importing convention
import pandas as pd

**Series**

In [2]:
obj = pd.Series([1, 2, 3, 4, 5])

In [3]:
obj

0    1
1    2
2    3
3    4
4    5
dtype: int64

As you can see `Series()` function returns a vectorized array with index values on it's left ranged from 0 to n-1.

In [4]:
obj.values

array([1, 2, 3, 4, 5], dtype=int64)

In [5]:
obj.index
# returns an index for series object

RangeIndex(start=0, stop=5, step=1)

We can also change index values:

In [6]:
obj2 = pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])

In [7]:
obj2

a    1
b    2
c    3
d    4
e    5
f    6
dtype: int64

In [8]:
obj2['e']

5

In [9]:
obj2.index

Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')

We can filter pandas DataFrame with boolean arrays, scaler multiplication, or applying math functions.

In [10]:
obj2[obj2 > 3]

d    4
e    5
f    6
dtype: int64

In [11]:
obj2 * 2

a     2
b     4
c     6
d     8
e    10
f    12
dtype: int64

In [12]:
import numpy as np

In [13]:
np.exp(obj2)

a      2.718282
b      7.389056
c     20.085537
d     54.598150
e    148.413159
f    403.428793
dtype: float64

We can also create a series with Python's dict().
```Python
>>data = {'Apple': '25$','Peas': '5$','Oranges': '25$','Berries': '25$'}
```

In [14]:
data = {'Delhi': 15000, 'Jaipur': 10000, 'Mumbai': 4511, 'Bengluru': 458778}

In [15]:
obj3 = pd.Series(data)

In [16]:
obj3

Delhi        15000
Jaipur       10000
Mumbai        4511
Bengluru    458778
dtype: int64

We can also override these index values.

In [17]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

In [18]:
obj4 = pd.Series(data, index=states)

In [19]:
obj4

California   NaN
Ohio         NaN
Oregon       NaN
Texas        NaN
dtype: float64

In [20]:
obj3 + obj4

Bengluru     NaN
California   NaN
Delhi        NaN
Jaipur       NaN
Mumbai       NaN
Ohio         NaN
Oregon       NaN
Texas        NaN
dtype: float64

We can also find the "Missing terms" or "NA" interchangeably to refer to missing data.
Functions are:
`pd.isnull(data)`, `pd.notnull(data)`

In [21]:
pd.isnull(obj4)

California    True
Ohio          True
Oregon        True
Texas         True
dtype: bool

In [22]:
pd.notnull(obj3)

Delhi       True
Jaipur      True
Mumbai      True
Bengluru    True
dtype: bool

In [23]:
obj3.Delhi

15000

## DataFrame

**The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index.**

In [24]:
dataFrame = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
            'year': [2000, 2001, 2002, 2001, 2002, 2003],
            'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [25]:
frame = pd.DataFrame(dataFrame)

In [26]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [27]:
not_nevada = (frame.state != 'Nevada')

In [28]:
frame[not_nevada]

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6


In [29]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


If you specify a sequence of columns, the DataFrame's columns will be arranged in that order:

In [30]:
pd.DataFrame(frame, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [31]:
frame2 = pd.DataFrame(frame, columns=['year', 'state', 'pop', 'debt'], 
                     index=['one', 'two', 'three', 'four', 'five', 'six'])

In [32]:
frame2

Unnamed: 0,year,state,pop,debt
one,,,,
two,,,,
three,,,,
four,,,,
five,,,,
six,,,,


In [33]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [34]:
frame2 = pd.DataFrame(dataFrame, columns=['state', 'year', 'pop', 'debt'], 
                     index=['one', 'two', 'three', 'four','five', 'six'])

In [35]:
frame2

Unnamed: 0,state,year,pop,debt
one,Ohio,2000,1.5,
two,Ohio,2001,1.7,
three,Ohio,2002,3.6,
four,Nevada,2001,2.4,
five,Nevada,2002,2.9,
six,Nevada,2003,3.2,


In [36]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [37]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

Rows also can be retrieved by position or name with special `loc` attribute

In [38]:
frame2.loc['three']

state    Ohio
year     2002
pop       3.6
debt      NaN
Name: three, dtype: object

Columns can be assigned a differet value.

In [39]:
frame2['debt'] = 0.0

In [40]:
frame2

Unnamed: 0,state,year,pop,debt
one,Ohio,2000,1.5,0.0
two,Ohio,2001,1.7,0.0
three,Ohio,2002,3.6,0.0
four,Nevada,2001,2.4,0.0
five,Nevada,2002,2.9,0.0
six,Nevada,2003,3.2,0.0


When you are assigning lists or arrays to a column, the value's length must match the length of the DataFrame.

In [41]:
frame2.loc['one']

state    Ohio
year     2000
pop       1.5
debt        0
Name: one, dtype: object

In [42]:
frame2.columns

Index(['state', 'year', 'pop', 'debt'], dtype='object')

In [43]:
frame['year']

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

In [44]:
frame2['debt'] = np.arange(6.)

In [45]:
frame2

Unnamed: 0,state,year,pop,debt
one,Ohio,2000,1.5,0.0
two,Ohio,2001,1.7,1.0
three,Ohio,2002,3.6,2.0
four,Nevada,2001,2.4,3.0
five,Nevada,2002,2.9,4.0
six,Nevada,2003,3.2,5.0


When assigning lists or arrays to a column, the value's length must match the length of the DataFrame.

In [46]:
val  = pd.Series([-1.2, -1.5, -1.7], index = ['two', 'four', 'five'])

In [47]:
val

two    -1.2
four   -1.5
five   -1.7
dtype: float64

In [48]:
frame2['debt'] = val

In [49]:
frame2

Unnamed: 0,state,year,pop,debt
one,Ohio,2000,1.5,
two,Ohio,2001,1.7,-1.2
three,Ohio,2002,3.6,
four,Nevada,2001,2.4,-1.5
five,Nevada,2002,2.9,-1.7
six,Nevada,2003,3.2,


`del` can delete any column.

In [50]:
frame2['eastern'] = frame2.state == 'Ohio'

In [51]:
frame2

Unnamed: 0,state,year,pop,debt,eastern
one,Ohio,2000,1.5,,True
two,Ohio,2001,1.7,-1.2,True
three,Ohio,2002,3.6,,True
four,Nevada,2001,2.4,-1.5,False
five,Nevada,2002,2.9,-1.7,False
six,Nevada,2003,3.2,,False


In [52]:
frame2.eastern

one       True
two       True
three     True
four     False
five     False
six      False
Name: eastern, dtype: bool

In [53]:
del frame2['eastern']

In [54]:
frame2

Unnamed: 0,state,year,pop,debt
one,Ohio,2000,1.5,
two,Ohio,2001,1.7,-1.2
three,Ohio,2002,3.6,
four,Nevada,2001,2.4,-1.5
five,Nevada,2002,2.9,-1.7
six,Nevada,2003,3.2,


In [55]:
pop = {'Nevada': {2001:2.4, 2002:2.9},
      'Ohio': {2000:1.5, 2001:1.7, 2002:3.6}}

In [56]:
frame3 = pd.DataFrame(pop)

In [57]:
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [58]:
frame3.loc[2000]

Nevada    NaN
Ohio      1.5
Name: 2000, dtype: float64

In [59]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


In [60]:
frame3.values

array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

In [61]:
frame2.values

array([['Ohio', 2000, 1.5, nan],
       ['Ohio', 2001, 1.7, -1.2],
       ['Ohio', 2002, 3.6, nan],
       ['Nevada', 2001, 2.4, -1.5],
       ['Nevada', 2002, 2.9, -1.7],
       ['Nevada', 2003, 3.2, nan]], dtype=object)

### Index Objects

In [62]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])

In [63]:
obj

a    0
b    1
c    2
dtype: int64

In [64]:
obj.index

Index(['a', 'b', 'c'], dtype='object')

In [65]:
index = obj.index

In [66]:
index[1:]

Index(['b', 'c'], dtype='object')

Padas Index can contain duplicate labels.


In [67]:
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])

In [68]:
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

## Essential Functionality

**Reindexing**

In [69]:
obj = pd.Series([4.5, 7.2, -5.3, 3.5], index=['d', 'a', 'c', 'b'])

In [70]:
obj

d    4.5
a    7.2
c   -5.3
b    3.5
dtype: float64

Calling reindexing on this Series rearranges the data according to the new index.

In [71]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

In [72]:
obj2

a    7.2
b    3.5
c   -5.3
d    4.5
e    NaN
dtype: float64

In [73]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)), 
                    index=['a', 'b', 'c'], 
                    columns=['Ohio', 'Texas', 'California'])

In [74]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
b,3,4,5
c,6,7,8


In [75]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])

In [76]:
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,3.0,4.0,5.0
c,6.0,7.0,8.0
d,,,


Columns can be reindexed with columns keyword.

In [77]:
states = ['Texa', 'Utah', 'California']

In [78]:
frame.reindex(columns=states)

Unnamed: 0,Texa,Utah,California
a,,,2
b,,,5
c,,,8


In [79]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
b,3,4,5
c,6,7,8


In [80]:
frame.loc[['a', 'b', 'c', 'd']]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  """Entry point for launching an IPython kernel.


Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,3.0,4.0,5.0
c,6.0,7.0,8.0
d,,,


#### Dropping Entries from an Axis

In [81]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

In [82]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [83]:
new_obj = obj.drop('c')

In [84]:
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [85]:
obj.drop(['a', 'e'])

b    1.0
c    2.0
d    3.0
dtype: float64

In [86]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                   index = ['a', 'b', 'c', 'd'],
                   columns=['Delhi', 'Jaipur', 'Mumbai', 'UP'])

In [87]:
data

Unnamed: 0,Delhi,Jaipur,Mumbai,UP
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15


In [88]:
data.drop(['a', 'c'])

Unnamed: 0,Delhi,Jaipur,Mumbai,UP
b,4,5,6,7
d,12,13,14,15


In [89]:
data

Unnamed: 0,Delhi,Jaipur,Mumbai,UP
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15


In [90]:
data.drop('Delhi', axis=1)

Unnamed: 0,Jaipur,Mumbai,UP
a,1,2,3
b,5,6,7
c,9,10,11
d,13,14,15


In [91]:
data.drop('a', inplace=True)

In [92]:
data

Unnamed: 0,Delhi,Jaipur,Mumbai,UP
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15


In [93]:
data

Unnamed: 0,Delhi,Jaipur,Mumbai,UP
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15


**`inplace` parameter if set to be true than it will going to destroy the object permanently.**
##### BE CAREFUL

## Indexing, Selection, and Filtering

In [94]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])

In [95]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [96]:
obj['b']

1.0

In [97]:
obj[1]

1.0

In [98]:
obj[['a', 'b']]

a    0.0
b    1.0
dtype: float64

In [99]:
obj['a':'c']

a    0.0
b    1.0
c    2.0
dtype: float64

In [100]:
obj['a':'c'] = 5

In [101]:
obj

a    5.0
b    5.0
c    5.0
d    3.0
dtype: float64

In [102]:
data

Unnamed: 0,Delhi,Jaipur,Mumbai,UP
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15


In [103]:
data[:2]

Unnamed: 0,Delhi,Jaipur,Mumbai,UP
b,4,5,6,7
c,8,9,10,11


In [104]:
data[data['UP'] > 8]

Unnamed: 0,Delhi,Jaipur,Mumbai,UP
c,8,9,10,11
d,12,13,14,15


In [105]:
data[data['UP'] < 10] = 0

In [106]:
data

Unnamed: 0,Delhi,Jaipur,Mumbai,UP
b,0,0,0,0
c,8,9,10,11
d,12,13,14,15


**Selection with loc and iloc**

In [107]:
data.loc['b', ['Delhi', 'Jaipur']]

Delhi     0
Jaipur    0
Name: b, dtype: int32

In [108]:
data.loc['c', ['Delhi', 'Jaipur']]

Delhi     8
Jaipur    9
Name: c, dtype: int32

In [109]:
data.iloc[2, [1, 2, 0]]

Jaipur    13
Mumbai    14
Delhi     12
Name: d, dtype: int32

When selcting data with `loc` or `iloc` you can assume it by .loc['rows', 'columns'] and .iloc[row_index, col_index]

In [110]:
data.iloc[2]

Delhi     12
Jaipur    13
Mumbai    14
UP        15
Name: d, dtype: int32

In [111]:
data.iloc[:, 2]

b     0
c    10
d    14
Name: Mumbai, dtype: int32

###### Airthmetic and Data Alignment

In [112]:
#Series 1
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'b', 'c', 'd'])
print(*s1.index)

a b c d


In [113]:
s2 = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

In [114]:
s2

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [115]:
s1 + s2

a    7.3
b   -1.5
c    5.4
d    4.5
e    NaN
dtype: float64

In [116]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado'])

In [117]:
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bcd'),
                  index = ['Utah', 'Ohio', 'Texas', 'Oregon'])

In [118]:
df1 + df2

Unnamed: 0,b,c,d
Colorado,,,
Ohio,3.0,5.0,7.0
Oregon,,,
Texas,9.0,11.0,13.0
Utah,,,


**Airthmetic methods with fill values**

In [120]:
s1 = pd.Series(np.arange(4.), index=['a', 'b', 'd', 'c'])

In [121]:
s2 = pd.Series(np.arange(5.), index=['a', 'b', 'e', 'd', 'c'])

In [122]:
s1

a    0.0
b    1.0
d    2.0
c    3.0
dtype: float64

In [123]:
s2

a    0.0
b    1.0
e    2.0
d    3.0
c    4.0
dtype: float64

In [124]:
s1 + s2

a    0.0
b    2.0
c    7.0
d    5.0
e    NaN
dtype: float64

In [125]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), 
                  columns=list('bcd'), 
                  index=['Ohio', 'Texas', 'Colorado'])

In [126]:
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), 
                  columns=list('bde'), 
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [127]:
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [128]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [129]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


Beacuse c and e both columns are not found in both DataFrame objects, they appear as all missing in the result.

In [130]:
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'A': [3, 4]})

In [131]:
df1 

Unnamed: 0,A
0,1
1,2


In [132]:
df2

Unnamed: 0,A
0,3
1,4


In [133]:
df1 + df2

Unnamed: 0,A
0,4
1,6


In [134]:
df1 - df2

Unnamed: 0,A
0,-2
1,-2


**Aitrthmetic methods with fill values**

In [135]:
df1

Unnamed: 0,A
0,1
1,2


In [136]:
df1.index

RangeIndex(start=0, stop=2, step=1)

In [137]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), 
                  columns=list('abcd'))

In [138]:
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), 
                  columns=list('abcde'))

In [139]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [140]:
df2.loc[1, 'b']

6.0

Suppose we want to change value on index 1 and 'b' column.

In [141]:
df2.loc[1, 'b'] = np.nan

In [142]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [143]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [144]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [145]:
df1.add(df2)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


So basically
```Python
>>df1 + df2 == (df1.add(df2))
```
But we have a different parameter in add method, which is `fill_value`

In [146]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [147]:
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0
1,4.0,5.0,6.0,7.0,0
2,8.0,9.0,10.0,11.0,0


In [149]:
df2.columns

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

**Operation between DataFrame and Series**

In [150]:
arr = np.arange(12.).reshape((3, 4))

In [151]:
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [152]:
arr - arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

This is basically a matrix substraction, i.e we are substracting first row from all of the matrice's rows.

In [153]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                    columns=list('bde'), 
                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [154]:
series = frame.iloc[0]

In [155]:
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [156]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [157]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [161]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])

In [162]:
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In [163]:
series3 = frame.iloc[:, 1]

In [164]:
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [165]:
frame.sub(series3, axis='index')

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


##### Function Application and Mapping

**NumPy ufuncs (element-wise array methods) also work with pandas**

In [166]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), 
                    index=['Utah', 'Ohio', 'Texas','Oregon'])

In [167]:
frame

Unnamed: 0,b,d,e
Utah,-1.05525,-0.379561,-1.267224
Ohio,-1.358436,1.118551,0.519981
Texas,-0.521828,-0.310115,-0.388143
Oregon,0.276433,-0.061211,-0.980376


np.abs(frame)

Another operation that we can use is `apply` on one dimensional array to each row or column.

In [169]:
#Function
f = lambda x : x.max() - x.min()

In [170]:
frame.apply(f)

b    1.634869
d    1.498111
e    1.787204
dtype: float64

When we pass axis = 'columns'

In [171]:
frame.apply(f, axis='columns')

Utah      0.887663
Ohio      2.476987
Texas     0.211712
Oregon    1.256808
dtype: float64

In [172]:
frame

Unnamed: 0,b,d,e
Utah,-1.05525,-0.379561,-1.267224
Ohio,-1.358436,1.118551,0.519981
Texas,-0.521828,-0.310115,-0.388143
Oregon,0.276433,-0.061211,-0.980376


Element wise computation with `applymap()`

In [173]:
f2 = lambda x: '%2.f' % x

In [174]:
frame.applymap(f2)

Unnamed: 0,b,d,e
Utah,-1,0,-1
Ohio,-1,1,1
Texas,-1,0,0
Oregon,0,0,-1


### Sorting and Ranking

In [175]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

In [176]:
obj

d    0
a    1
b    2
c    3
dtype: int64

As you can see that index is totally messed up in obj series. To rearrange it in an oreder you can apply `sort_index()` on it.

In [177]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

For DataFrame

In [178]:
df1 = pd.DataFrame(np.arange(8).reshape((2, 4)), 
                  index=['three', 'one'], 
                  columns=['d', 'a', 'b', 'c'])

In [179]:
df1.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


We can also sort order of columns by giving **axis = 1** parameter to `sort_index()`

In [180]:
df1.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


Data is sorted in ascending order by default we can change that order by giving a hyperparameter **`ascending=False`**

In [181]:
df1.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


We can also sort a series y it's value, with `sort_values()` method.

In [184]:
frame = pd.Series([2, -1, 3, 5.5, 12, -90])

In [185]:
frame.sort_values()

5   -90.0
1    -1.0
0     2.0
2     3.0
3     5.5
4    12.0
dtype: float64

In [186]:
frame = pd.DataFrame({'b': [4, 5, -3, 2], 'a':[0, 1, 0, 1]})

In [187]:
frame

Unnamed: 0,b,a
0,4,0
1,5,1
2,-3,0
3,2,1


In [188]:
frame.sort_values(by=['a', 'b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,5,1


In [189]:
frame.sort_values(by='b')

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,5,1


In [190]:
frame.sort_values(by=['a', 'b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,5,1


*Ranking* assigns rank from one through the number of valid data point in an array.

In [191]:
obj = pd.Series([7, -2, 2, 0, -12, -9, 7.9])

In [198]:
obj.rank()

0    6.0
1    3.0
2    5.0
3    4.0
4    1.0
5    2.0
6    7.0
dtype: float64

In [199]:
obj.rank(method='first')

0    6.0
1    3.0
2    5.0
3    4.0
4    1.0
5    2.0
6    7.0
dtype: float64

We can rank in descending order too.

In [200]:
obj.rank(ascending=False, method='max')

0    2.0
1    5.0
2    3.0
3    4.0
4    7.0
5    6.0
6    1.0
dtype: float64

In [201]:
frame

Unnamed: 0,b,a
0,4,0
1,5,1
2,-3,0
3,2,1


In [202]:
frame = pd.DataFrame({'b':[4.3, 7, -3, 2], 'a':[0, 1, 0, 1], 'c':[-2, 5, 8, -2.5]})

In [203]:
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [204]:
frame.rank(axis='columns')

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


In [205]:
frame = frame.sort_index(axis='columns')

In [206]:
frame

Unnamed: 0,a,b,c
0,0,4.3,-2.0
1,1,7.0,5.0
2,0,-3.0,8.0
3,1,2.0,-2.5


In [207]:
frame.rank(axis='columns')

Unnamed: 0,a,b,c
0,2.0,3.0,1.0
1,1.0,3.0,2.0
2,2.0,1.0,3.0
3,2.0,3.0,1.0


#### Axis Indexes with Duplicate Labels

In [208]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])

In [209]:
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [210]:
obj.index.is_unique

False

In [211]:
obj['a']

a    0
a    1
dtype: int64

In [212]:
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])

In [213]:
df

Unnamed: 0,0,1,2
a,-0.869086,-0.19384,-1.545872
a,0.841958,0.009237,-0.637422
b,1.205296,-0.426974,-0.971985
b,-0.374197,3.264281,-1.389393


In [214]:
df.loc['b']

Unnamed: 0,0,1,2
b,1.205296,-0.426974,-0.971985
b,-0.374197,3.264281,-1.389393


###### Summarizing and Computing Descriptive Staistics

In [215]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], 
                  [np.nan, np.nan], [0.75, -1.2]],
                 index=['a', 'b', 'c', 'd'],
                 columns=['one', 'two'])

In [216]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.2


In [217]:
df.sum()

one    9.25
two   -5.70
dtype: float64

So calling sum on DataFrame df is going to give us column sum for both one and two columns.

In [219]:
df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.45
dtype: float64

In [220]:
df.mean(axis='columns', skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.225
dtype: float64

In [221]:
df.idxmax()

one    b
two    d
dtype: object

So `idxmax()` tells us that where in DataFrame's column we are getting the maximum value in indexed position.

In [222]:
df.idxmin()

one    d
two    b
dtype: object

In [223]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.7


In [224]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.85
std,3.493685,2.333452
min,0.75,-4.5
25%,1.075,-3.675
50%,1.4,-2.85
75%,4.25,-2.025
max,7.1,-1.2


In [225]:
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)

In [226]:
obj

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

In [228]:
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

Above `describe` method is telling us that we have an array consisting of 16 element that's why `count = 0`, and out of those 16 we have 3 unique values which are **a, b and c**, and in those values **a** has most frequent number which is 8. That's why `top = a` and `frq = 8`.

### Correlation and Covariance

Corelation and Covariance are computed from pairs of arguments.

In [230]:
import pandas_datareader.data as web

In [231]:
all_data = {ticker: web.get_data_yahoo(ticker) for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

In [238]:
price = pd.DataFrame({ticker: data['Adj Close'] for ticker, data in all_data.items()})

In [239]:
price

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2009-12-31,20.159719,100.460022,24.345514,307.986847
2010-01-04,20.473503,101.649574,24.720928,311.349976
2010-01-05,20.508902,100.421654,24.728914,309.978882
2010-01-06,20.182680,99.769310,24.577150,302.164703
2010-01-07,20.145369,99.423950,24.321552,295.130463
2010-01-08,20.279305,100.421654,24.489288,299.064880
2010-01-11,20.100410,99.370209,24.177786,298.612823
2010-01-12,19.871763,100.160728,24.018032,293.332153
2010-01-13,20.152065,99.945816,24.241684,291.648102
2010-01-14,20.035355,101.542137,24.728914,293.019196


In [240]:
volume = pd.DataFrame({ticker:data['Volume'] for ticker, data in all_data.items()})

In [241]:
volume

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2009-12-31,88102700.0,4223400.0,31929700.0,2455400.0
2010-01-04,123432400.0,6155300.0,38409100.0,3937800.0
2010-01-05,150476200.0,6841400.0,49749600.0,6048500.0
2010-01-06,138040000.0,5605300.0,58182400.0,8009000.0
2010-01-07,119282800.0,5840600.0,50559700.0,12912000.0
2010-01-08,111902700.0,4197200.0,51197400.0,9509900.0
2010-01-11,115557400.0,5730400.0,68754700.0,14519600.0
2010-01-12,148614900.0,8081500.0,65912100.0,9769600.0
2010-01-13,151473000.0,6455400.0,51863500.0,13077600.0
2010-01-14,108223500.0,7111800.0,63228100.0,8535300.0


In [242]:
returns = price.pct_change()

In [243]:
returns.tail()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-12-28,0.000512,-0.006592,-0.007808,-0.006514
2018-12-31,0.009665,0.005662,0.011754,-0.001417
2019-01-02,0.001141,0.013548,-0.00443,0.009888
2019-01-03,-0.099607,-0.019964,-0.036788,-0.028484
2019-01-04,0.042689,0.039058,0.046509,0.053786


In [244]:
returns['MSFT'].corr(returns['IBM'])

0.49018506669639705

In [245]:
returns['MSFT'].cov(returns['IBM'])

8.764455658027708e-05

In [253]:
all_data['AAPL']['Adj Close']

Date
2009-12-31     20.159719
2010-01-04     20.473503
2010-01-05     20.508902
2010-01-06     20.182680
2010-01-07     20.145369
2010-01-08     20.279305
2010-01-11     20.100410
2010-01-12     19.871763
2010-01-13     20.152065
2010-01-14     20.035355
2010-01-15     19.700523
2010-01-19     20.572037
2010-01-20     20.255390
2010-01-21     19.905249
2010-01-22     18.917973
2010-01-25     19.426920
2010-01-26     19.701479
2010-01-27     19.887072
2010-01-28     19.065300
2010-01-29     18.373636
2010-02-01     18.629066
2010-02-02     18.737165
2010-02-03     19.059557
2010-02-04     18.372683
2010-02-05     18.698900
2010-02-08     18.570709
2010-02-09     18.768732
2010-02-10     18.666370
2010-02-11     19.005989
2010-02-12     19.169575
                 ...    
2018-11-20    176.979996
2018-11-21    176.779999
2018-11-23    172.289993
2018-11-26    174.619995
2018-11-27    174.240005
2018-11-28    180.940002
2018-11-29    179.550003
2018-11-30    178.580002
2018-12-03    184.82

In [261]:
l = []
for ticker, data in all_data.items():
    l.append(ticker)

In [262]:
l

['AAPL', 'IBM', 'MSFT', 'GOOG']

In [265]:
type(all_data['AAPL']['Adj Close'])

pandas.core.series.Series

So what we are doing in price section is fairly simple, first we are making a pandas DataFrame and applying a dictionary comprehension for every company's `Adj Close`, which we are getting in pandas's series format.

Same goes with the `Volume` section.

In [267]:
returns.corr()['AAPL']

AAPL    1.000000
IBM     0.374848
MSFT    0.446303
GOOG    0.453676
Name: AAPL, dtype: float64

In [268]:
returns.cov()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,0.000271,7.6e-05,0.000107,0.000115
IBM,7.6e-05,0.000151,8.8e-05,7.8e-05
MSFT,0.000107,8.8e-05,0.000212,0.000121
GOOG,0.000115,7.8e-05,0.000121,0.000239


we can use DataFrame's `corrwith` method to compute pairwise correlations between a DatFrame's columns or rows with another Sereis or DatFrame.

In [269]:
returns.corrwith(returns.IBM)

AAPL    0.374848
IBM     1.000000
MSFT    0.490185
GOOG    0.411534
dtype: float64

above is correlation percentage changed with the IBM data.

We can even check correlation percentage change with the volume.

In [270]:
returns.corrwith(volume)

AAPL   -0.058709
IBM    -0.176611
MSFT   -0.089684
GOOG   -0.016911
dtype: float64

## Unique Values, Value Counts, and Membership

In [272]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

In [273]:
obj.unique()

array(['c', 'a', 'd', 'b'], dtype=object)

In [274]:
obj.describe()

count     9
unique    4
top       c
freq      3
dtype: object

In [275]:
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

The Series is sorted by value in descending order as convenience. `value_counts` is also availabe as a top-level pandas method that can be used with any array or sequence.

In [276]:
pd.value_counts(obj.values, sort=False)

b    2
a    3
d    1
c    3
dtype: int64

In [277]:
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [278]:
mask = obj.isin(['b', 'c'])

In [279]:
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [280]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

In some cases, you may want to compute a histogram on multiple related columns in a DataFrame.

In [282]:
data = pd.DataFrame({'Qu1': [1, 2, 4, 6, 3], 
                   'Qu2': [2, 3, 1, 5, 6], 
                   'Qu3': [1, 2, 3, 5, 6]})

In [283]:
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,2,3,2
2,4,1,3
3,6,5,5
4,3,6,6


In [284]:
result = data.apply(pd.value_counts).fillna(0)

In [285]:
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,1.0,1.0,1.0
3,1.0,1.0,1.0
4,1.0,0.0,0.0
5,0.0,1.0,1.0
6,1.0,1.0,1.0


# THE END :) :)