# PANDAS

# The Series

The series is the object of the pandas library designed to represent one-dimensional data
structures, similar to an array but with some additional features. Its internal structure is
simple and is composed of two arrays associated with each other. The
main array holds the data (data of any NumPy type) to which each element is associated
with a label, contained within the other array, called the index.

### Declaring a Series

In [2]:
import numpy as np

In [3]:
import pandas as pd

In [3]:
labels =['a','b','c']
my_data = [10,20,30]
arr = np.array(my_data)
d = {'a':10, 'b':20, 'c':30}

In [5]:
s = pd.Series(data = my_data)
s

0    10
1    20
2    30
dtype: int64

In [7]:
s = pd.Series(data = my_data, index = labels, dtype = float)
s

a    10.0
b    20.0
c    30.0
dtype: float64

In [9]:
s = pd.Series(arr)
s

0    10
1    20
2    30
dtype: int32

In [10]:
s = pd.Series(d)
s  # sets the keys as index in a dictionary

a    10
b    20
c    30
dtype: int64

In [11]:
# series can hold any kind of data types 

In [12]:
s.values

array([10, 20, 30], dtype=int64)

In [13]:
s.index

Index(['a', 'b', 'c'], dtype='object')

### Selecting the Internal Elements

In [24]:
s = pd.Series([1,2,3,4],['USA','CHINA','JAPAN','INDIA'])
s

USA      1
CHINA    2
JAPAN    3
INDIA    4
dtype: int64

In [27]:
s['INDIA']  # Here index is an object , so we use inverted commas 

4

In [28]:
s[['USA',"JAPAN"]]  # for list of labels alway provide an array [[....]]

USA      1
JAPAN    3
dtype: int64

In [29]:
s[2:4]

JAPAN    3
INDIA    4
dtype: int64

In [31]:
s = pd.Series([10,20,33,40,50,60])
s

0    10
1    20
2    33
3    40
4    50
5    60
dtype: int64

In [32]:
s[3]

40

In [33]:
s[0:4]

0    10
1    20
2    33
3    40
dtype: int64

###  Assigning Values to the Elements

In [36]:
s = pd.Series([10,20,33,40,50,60])
s

0    10
1    20
2    33
3    40
4    50
5    60
dtype: int64

In [35]:
s[3] = 44
s

0    10
1    20
2    33
3    44
4    50
5    60
dtype: int64

In [37]:
s[1:4]=[5,6,7]
s

0    10
1     5
2     6
3     7
4    50
5    60
dtype: int64

### Defining a Series from NumPy Arrays and Other Series

In [44]:
arr = np.array([1,2,3,4])
s3 = pd.Series(arr)
s3

0    1
1    2
2    3
3    4
dtype: int32

Always keep in mind that the values contained in the NumPy array or in the
original series are not copied, but are passed by reference. That is, the object is inserted
dynamically within the new series object. If it changes, for example its internal element
varies in value, then those changes will also be present in the new series object

In [45]:
arr[2]= -22.2

In [46]:
s3

0     1
1     2
2   -22
3     4
dtype: int32

However, note that the dtype does not change , it remains the same as that of the original array created. Thus it makes -22.2 rounded to an integer -22. Thus if the original dtype of array would have been float, then -22.2 would have been taken. 

Note: The dtype here is based on the array, not the series dtype. If you instead change the dtype of the series to float while the dtype of the array remains the same as int , then you change the imput at index 2 in the array to -22.2 , no change will take place in the series. 

In [57]:
arr = np.array([1,2,3,4], dtype = float)
s3 = pd.Series(arr)
s3

0    1.0
1    2.0
2    3.0
3    4.0
dtype: float64

In [58]:
arr[2]= -22.2

In [59]:
s3

0     1.0
1     2.0
2   -22.2
3     4.0
dtype: float64

### Filtering Values

Thanks to the choice of the NumPy library as the base of the pandas library and, as a
result, for its data structures, many operations that are applicable to NumPy arrays are
extended to the series. One of these is filtering values contained in the data structure
through conditions.
For example, if you need to know which elements in the series are greater than 8, you
write the following:

In [60]:
s = pd.Series([10,20,33,40,50,60])
s [s >22]

2    33
3    40
4    50
5    60
dtype: int64

### Operations and Mathematical Functions

Other operations such as operators (+, -, *, and /) and mathematical functions that are
applicable to NumPy array can be extended to series.
You can simply write the arithmetic expression for the operators.

In [63]:
round(s/7, 2)

0    1.43
1    2.86
2    4.71
3    5.71
4    7.14
5    8.57
dtype: float64

However, with the NumPy mathematical functions, you must specify the function
referenced with np and the instance of the series passed as an argument.

In [64]:
np.log(s)

0    2.302585
1    2.995732
2    3.496508
3    3.688879
4    3.912023
5    4.094345
dtype: float64

### Evaluating Values 

There are often duplicate values in a series. Then you may need to have more
information about the samples, including existence of any duplicates and whether a
certain value is present in the series.
In this regard, you can declare a series in which there are many duplicate values

In [67]:
serd = pd.Series([1,0,2,1,2,3], index=['white','white','blue','green',
                                       'green','yellow'])

serd

white     1
white     0
blue      2
green     1
green     2
yellow    3
dtype: int64

To know all the values contained in the series, excluding duplicates, you can use
the unique() function. The return value is an array containing the unique values in the
series, although not necessarily in order.

In [68]:
serd.unique()

array([1, 0, 2, 3], dtype=int64)

A function that’s similar to unique() is value_counts(), which not only returns
unique values but also calculates the occurrences within a series

In [69]:
serd.value_counts()

2    2
1    2
3    1
0    1
dtype: int64

Finally, isin() evaluates the membership, that is, the given a list of values. This
function tells you if the values are contained in the data structure. Boolean values that are
returned can be very useful when filtering data in a series or in a column of a dataframe.

In [70]:
serd.isin([0,3])

white     False
white      True
blue      False
green     False
green     False
yellow     True
dtype: bool

In [71]:
serd[serd.isin([0,3])]

white     0
yellow    3
dtype: int64

### NaN values

Generally, these NaN values are a problem and must be managed in some way,
especially during data analysis. These data are often generated when extracting data
from a questionable source or when the source is missing data. Furthermore, as you
have just seen, the NaN values can also be generated in special cases, such as calculations
of logarithms of negative values, or exceptions during execution of some calculation
or function

Despite their problematic nature, however, pandas allows you to explicitly define
NaNs and add them to a data structure, such as a series. Within the array containing the
values, you enter np.NaN wherever you want to define a missing value.

In [73]:
s2 = pd.Series([5,-3,np.NaN,14])
s2

0     5.0
1    -3.0
2     NaN
3    14.0
dtype: float64

The isnull() and notnull() functions are very useful to identify the indexes
without a value.

In [74]:
s2.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [75]:
s2.notnull()

0     True
1     True
2    False
3     True
dtype: bool

In [76]:
s2[s2.notnull()]

0     5.0
1    -3.0
3    14.0
dtype: float64

### Series as Dictionaries

In [77]:
mydict = {'red': 2000, 'blue': 1000, 'yellow': 500,'orange': 1000}
myseries = pd.Series(mydict)
myseries

red       2000
blue      1000
yellow     500
orange    1000
dtype: int64

### Operations Between Series

In [80]:
mydict = {'red': 2000, 'blue': 1000, 'yellow': 500,'orange': 1000}
myseries1 = pd.Series(mydict)
print(myseries1)

mydict2 = {'red':400,'yellow':1000,'black':700}
myseries2 = pd.Series(mydict2)
print(myseries2)

red       2000
blue      1000
yellow     500
orange    1000
dtype: int64
red        400
yellow    1000
black      700
dtype: int64


In [79]:
myseries1+myseries2

black        NaN
blue         NaN
orange       NaN
red       2400.0
yellow    1500.0
dtype: float64

You get a new object series in which only the items with the same label are added.
All other labels present in one of the two series are still added to the result but have a NaN
value.

## The DataFrame

The dataframe is a tabular data structure very similar to a spreadsheet. This data
structure is designed to extend series to multiple dimensions. In fact, the dataframe
consists of an ordered collection of columns (see Figure 4-2), each of which can contain
a value of a different type (numeric, string, Boolean, etc.).

Unlike series, which have an index array containing labels associated with each
element, the dataframe has two index arrays. The first index array, associated with
the lines, has very similar functions to the index array in series. In fact, each label is
associated with all the values in the row. The second array contains a series of labels,
each associated with a particular column.
A dataframe may also be understood as a dict of series, where the keys are the
column names and the values are the series that will form the columns of the dataframe.
Furthermore, all elements in each series are mapped according to an array of labels,
called the index.

### Defining a Dataframe

In [119]:
data = {'color' : ['blue','green','yellow','red','white'],
        'object' : ['ball','pen','pencil','paper','mug'],
        'price' : [1.2,1.0,0.6,0.9,1.7]}

frame =pd.DataFrame(data)
frame

Unnamed: 0,color,object,price
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6
3,red,paper,0.9
4,white,mug,1.7


If the dict object from which you want to create a dataframe contains more data
than you are interested in, you can make a selection. In the constructor of the dataframe,
you can specify a sequence of columns using the columns option. The columns will be
created in the order of the sequence regardless of how they are contained in the dict
object.

In [83]:
frame = pd.DataFrame(data, columns = ['object','price'])
frame

Unnamed: 0,object,price
0,ball,1.2
1,pen,1.0
2,pencil,0.6
3,paper,0.9
4,mug,1.7


In [104]:
frame = pd.DataFrame(data, index=['one','two','three','four','five'])
frame

Unnamed: 0,color,object,price
one,blue,ball,1.2
two,green,pen,1.0
three,yellow,pencil,0.6
four,red,paper,0.9
five,white,mug,1.7


Now that we have introduced the two new options called index and columns, it is
easy to imagine an alternative way to define a dataframe. Instead of using a dict object,
you can define three arguments in the constructor, in the following order—a data matrix,
an array containing the labels assigned to the index option, and an array containing the
names of the columns assigned to the columns option.

In [86]:
frame3 = pd.DataFrame(np.arange(16).reshape((4,4)),
                      index=['red','blue','yellow','white'],
                      columns=['ball','pen','pencil','paper'])

frame3

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


### Selecting Elements

In [87]:
frame

Unnamed: 0,color,object,price
one,blue,ball,1.2
two,green,pen,1.0
three,yellow,pencil,0.6
four,red,paper,0.9
five,white,mug,1.7


In [88]:
frame.columns

Index(['color', 'object', 'price'], dtype='object')

In [89]:
frame.index

Index(['one', 'two', 'three', 'four', 'five'], dtype='object')

In [90]:
frame.values

array([['blue', 'ball', 1.2],
       ['green', 'pen', 1.0],
       ['yellow', 'pencil', 0.6],
       ['red', 'paper', 0.9],
       ['white', 'mug', 1.7]], dtype=object)

In [91]:
frame['price']

one      1.2
two      1.0
three    0.6
four     0.9
five     1.7
Name: price, dtype: float64

In [94]:
frame.price

one      1.2
two      1.0
three    0.6
four     0.9
five     1.7
Name: price, dtype: float64

In [95]:
frame.loc['two']

color     green
object      pen
price         1
Name: two, dtype: object

In [105]:
frame.loc[['one','three','five']]

Unnamed: 0,color,object,price
one,blue,ball,1.2
three,yellow,pencil,0.6
five,white,mug,1.7


In [99]:
frame  # consider this new dataframe

Unnamed: 0,color,object,price
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6
3,red,paper,0.9
4,white,mug,1.7


In [100]:
frame.loc[2:4]

Unnamed: 0,color,object,price
2,yellow,pencil,0.6
3,red,paper,0.9
4,white,mug,1.7


In [101]:
frame.loc[[2,4]]   # multiple separate index should be given in array [[....]]

Unnamed: 0,color,object,price
2,yellow,pencil,0.6
4,white,mug,1.7


In [102]:
frame['object'][4]

'mug'

In [None]:
frame['column name'][index]

### Assigning Values

In [120]:
frame.index.name = 'id'
frame.columns.name = 'items'
frame

items,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6
3,red,paper,0.9
4,white,mug,1.7


In [121]:
# add  a new column

frame['new'] = [5,6,4,7,50]
frame

items,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,ball,1.2,5
1,green,pen,1.0,6
2,yellow,pencil,0.6,4
3,red,paper,0.9,7
4,white,mug,1.7,50


In [122]:
frame['new'] = [3.0,1.3,2.2,0.8,1.1] # update
frame

items,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,ball,1.2,3.0
1,green,pen,1.0,1.3
2,yellow,pencil,0.6,2.2
3,red,paper,0.9,0.8
4,white,mug,1.7,1.1


You can follow a similar approach if you want to update an entire column, for
example, by using the np.arange() function to update the values of a column with a
predetermined sequence.

In [123]:
ser = pd.Series(np.arange(5), dtype= float)
ser

0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

In [124]:
frame['new'] = ser
frame

items,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,ball,1.2,0.0
1,green,pen,1.0,1.0
2,yellow,pencil,0.6,2.0
3,red,paper,0.9,3.0
4,white,mug,1.7,4.0


In [125]:
frame['price'][2] = 3.3

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  frame['price'][2] = 3.3


In [126]:
frame

items,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,ball,1.2,0.0
1,green,pen,1.0,1.0
2,yellow,pencil,3.3,2.0
3,red,paper,0.9,3.0
4,white,mug,1.7,4.0


### Membership of a Value

In [127]:
frame.isin([1.0,'pen'])

items,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,False,False,False,False
1,False,True,True,True
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False


In [128]:
frame[frame.isin([1.0,'pen'])]

items,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,,,,
1,,pen,1.0,1.0
2,,,,
3,,,,
4,,,,


### Deleting a Column

In [129]:
frame

items,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,ball,1.2,0.0
1,green,pen,1.0,1.0
2,yellow,pencil,3.3,2.0
3,red,paper,0.9,3.0
4,white,mug,1.7,4.0


In [130]:
del frame['new']

In [136]:
frame

items,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,3.3
3,red,paper,0.9
4,white,mug,1.7


### Filtering

Even when a dataframe, you can apply the filtering through the application of certain
conditions. For example, say you want to get all values smaller than a certain number

In [143]:
frame[frame.price < 1.2]

items,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,green,pen,1.0
3,red,paper,0.9


### DataFrame from Nested dict

In [144]:
nestdict = { 'red': { 2012: 22, 2013: 33 },
            'white': { 2011: 13, 2012: 22, 2013: 16},
            'blue': {2011: 17, 2012: 27, 2013: 18}}
nestdict

{'red': {2012: 22, 2013: 33},
 'white': {2011: 13, 2012: 22, 2013: 16},
 'blue': {2011: 17, 2012: 27, 2013: 18}}

In [145]:
frame2 = pd.DataFrame(nestdict)
frame2

Unnamed: 0,red,white,blue
2012,22.0,22,27
2013,33.0,16,18
2011,,13,17


### Transposition of a Dataframe

In [146]:
frame2.T

Unnamed: 0,2012,2013,2011
red,22.0,33.0,
white,22.0,16.0,13.0
blue,27.0,18.0,17.0


## The Index Objects

In [147]:
ser = pd.Series([5,0,3,8,4], index=['red','blue','yellow','white','green'])
ser.index

Index(['red', 'blue', 'yellow', 'white', 'green'], dtype='object')

### Methods on Index

In [149]:
ser.idxmin()

'blue'

In [150]:
ser.idxmax()

'white'

### Index with Duplicate Labels

In [152]:
serd = pd.Series(range(6), index=['white','white','blue','green',
                                  'green','yellow'])
serd

white     0
white     1
blue      2
green     3
green     4
yellow    5
dtype: int64

In [153]:
serd.index.is_unique

False

### Other Functionalities on Indexes

#### Reindexing

It was previously stated that once it’s declared in a data structure, the Index object
cannot be changed. This is true, but by executing a reindexing, you can also overcome
this problem.

In [154]:
ser = pd.Series([2,5,7,4], index=['one','two','three','four'])
ser

one      2
two      5
three    7
four     4
dtype: int64

In order to reindex this series, pandas provides you with the reindex() function.
This function creates a new series object with the values of the previous series
rearranged according to the new sequence of labels.

During reindexing, it is possible to change the order of the sequence of indexes,
delete some of them, or add new ones. In the case of a new label, pandas adds NaN as the
corresponding value.

In [156]:
ser.reindex(["three","four","five","one"])

three    7.0
four     4.0
five     NaN
one      2.0
dtype: float64

In [157]:
ser3 = pd.Series([1,5,6,3],index=[0,3,5,6])
ser3

0    1
3    5
5    6
6    3
dtype: int64

As you can see in this example, the index column is not a perfect sequence of
numbers; in fact there are some missing values (1, 2, and 4). A common need would
be to perform interpolation in order to obtain the complete sequence of numbers. To
achieve this, you will use reindexing with the method option set to ffill. Moreover, you
need to set a range of values for indexes. In this case, to specify a set of values between 0
and 5, you can use range(6) as an argument.

In [163]:
ser3.reindex(range(7),method='ffill')

0    1
1    1
2    1
3    5
4    5
5    6
6    3
dtype: int64

As you can see from the result, the indexes that were not present in the original series
were added. By interpolation, those with the lowest index in the original series have
been assigned as values. In fact, the indexes 1 and 2 have the value 1, which belongs to
index 0.

In [164]:
ser3.reindex(range(7),method='bfill')

0    1
1    5
2    5
3    5
4    6
5    6
6    3
dtype: int64

#### Dropping

In [165]:
ser = pd.Series(np.arange(4.), index=['red','blue','yellow','white'])
ser

red       0.0
blue      1.0
yellow    2.0
white     3.0
dtype: float64

In [166]:
ser.drop('yellow')

red      0.0
blue     1.0
white    3.0
dtype: float64

In [167]:
ser.drop(['blue','white'])

red       0.0
yellow    2.0
dtype: float64

In [168]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index=['red','blue','yellow','white'],
                     columns=['ball','pen','pencil','paper'])
frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


To delete rows, you just pass the indexes of the rows.

In [169]:
frame.drop(['blue','yellow'])

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
white,12,13,14,15


To delete columns, you always need to specify the indexes of the columns, but you
must specify the axis from which to delete the elements, and this can be done using the
axis option. So to refer to the column names, you should specify axis = 1.

In [170]:
frame.drop(['pen','pencil'],axis=1)

Unnamed: 0,ball,paper
red,0,3
blue,4,7
yellow,8,11
white,12,15


#### Arithmetic and Data Alignment

In [171]:
s1 = pd.Series([3,2,5,1],['white','yellow','green','blue'])
s2 = pd.Series([1,4,7,2,1],['white','yellow','black','blue','brown'])

In [172]:
s1+s2

black     NaN
blue      3.0
brown     NaN
green     NaN
white     4.0
yellow    6.0
dtype: float64

In [173]:
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                      index=['red','blue','yellow','white'],
                      columns=['ball','pen','pencil','paper'])

frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),
                      index=['blue','green','white','yellow'],
                      columns=['mug','pen','ball'])

In [174]:
frame1

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [175]:
frame2

Unnamed: 0,mug,pen,ball
blue,0,1,2
green,3,4,5
white,6,7,8
yellow,9,10,11


In [176]:
frame1+frame2

Unnamed: 0,ball,mug,paper,pen,pencil
blue,6.0,,,6.0,
green,,,,,
red,,,,,
white,20.0,,,20.0,
yellow,19.0,,,19.0,


### Operations Between Data Structures

#### Flexible Arithmetic Methods

In [177]:
frame1

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [178]:
frame2

Unnamed: 0,mug,pen,ball
blue,0,1,2
green,3,4,5
white,6,7,8
yellow,9,10,11


In [179]:
frame.add(frame2)

Unnamed: 0,ball,mug,paper,pen,pencil
blue,6.0,,,6.0,
green,,,,,
red,,,,,
white,20.0,,,20.0,
yellow,19.0,,,19.0,


In [183]:
frame1.sub(frame2)

Unnamed: 0,ball,mug,paper,pen,pencil
blue,2.0,,,4.0,
green,,,,,
red,,,,,
white,4.0,,,6.0,
yellow,-3.0,,,-1.0,


In [182]:
frame1.mul(frame2)

Unnamed: 0,ball,mug,paper,pen,pencil
blue,8.0,,,5.0,
green,,,,,
red,,,,,
white,96.0,,,91.0,
yellow,88.0,,,90.0,


In [184]:
frame1.div(frame2)

Unnamed: 0,ball,mug,paper,pen,pencil
blue,2.0,,,5.0,
green,,,,,
red,,,,,
white,1.5,,,1.857143,
yellow,0.727273,,,0.9,


## Operations Between DataFrame and Series

In [185]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index=['red','blue','yellow','white'],
                     columns=['ball','pen','pencil','paper'])

frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [186]:
ser = pd.Series(np.arange(4), index=['ball','pen','pencil','paper'])
ser

ball      0
pen       1
pencil    2
paper     3
dtype: int32

In [187]:
frame + ser

Unnamed: 0,ball,pen,pencil,paper
red,0,2,4,6
blue,4,6,8,10
yellow,8,10,12,14
white,12,14,16,18


In [188]:
frame.sub(ser)

Unnamed: 0,ball,pen,pencil,paper
red,0,0,0,0
blue,4,4,4,4
yellow,8,8,8,8
white,12,12,12,12


In [189]:
ser['mug']= 9
ser

ball      0
pen       1
pencil    2
paper     3
mug       9
dtype: int64

In [190]:
frame + ser

Unnamed: 0,ball,mug,paper,pen,pencil
red,0,,6,2,4
blue,4,,10,6,8
yellow,8,,14,10,12
white,12,,18,14,16


If an index is not present in one of the two data structures, the result will be a new
column with that index only that all its elements will be NaN.

## Function Application and Mapping

### Functions by Element

The pandas library is built on the foundations of NumPy and then extends many of its
features by adapting them to new data structures as series and dataframe. Among these
are the universal functions, called ufunc. This class of functions operates by element in
the data structure

In [191]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index=['red','blue','yellow','white'],
                     columns=['ball','pen','pencil','paper'])

frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [192]:
np.sqrt(frame)

Unnamed: 0,ball,pen,pencil,paper
red,0.0,1.0,1.414214,1.732051
blue,2.0,2.236068,2.44949,2.645751
yellow,2.828427,3.0,3.162278,3.316625
white,3.464102,3.605551,3.741657,3.872983


### Functions by Row or Column

The application of the functions is not limited to the ufunc functions, but also includes
those defined by the user. The important point is that they operate on a one-dimensional
array, giving a single number as a result. For example, you can define a lambda function
that calculates the range covered by the elements in an array

In [200]:
def f(x):
    return x**(1/3)

frame.apply(f) # does by column i.e by ball, then by pen ,.....

Unnamed: 0,ball,pen,pencil,paper
red,0.0,1.0,1.259921,1.44225
blue,1.587401,1.709976,1.817121,1.912931
yellow,2.0,2.080084,2.154435,2.22398
white,2.289428,2.351335,2.410142,2.466212


In [214]:
frame.ball.apply(f)

red       0.000000
blue      1.587401
yellow    2.000000
white     2.289428
Name: ball, dtype: float64

In [None]:
frame.red.apply(f)  # gives error since red is an index and not a column

In [212]:
frame.T.red.apply(f)  # by transpose make the index , columns and 
                      #then do operation for a particular column 

ball      0.000000
pen       1.000000
pencil    1.259921
paper     1.442250
Name: red, dtype: float64

In [216]:
frame.apply(f, axis =1) # does the operation by row i.e for red, for blue....

Unnamed: 0,ball,pen,pencil,paper
red,0.0,1.0,1.259921,1.44225
blue,1.587401,1.709976,1.817121,1.912931
yellow,2.0,2.080084,2.154435,2.22398
white,2.289428,2.351335,2.410142,2.466212


It is not mandatory that the method apply() return a scalar value. It can also
return a series. A useful case would be to extend the application to many functions
simultaneously. In this case, we will have two or more values for each feature applied.

In [217]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min','max'])

In [218]:
frame.apply(f)

Unnamed: 0,ball,pen,pencil,paper
min,0,1,2,3
max,12,13,14,15


In [219]:
frame.apply(f, axis = 1)

Unnamed: 0,min,max
red,0,3
blue,4,7
yellow,8,11
white,12,15


## Statistics Functions

Most of the statistical functions for arrays are still valid for dataframe, so using the
apply() function is no longer necessary. For example, functions such as sum() and
mean() can calculate the sum and the average, respectively, of the elements contained
within a dataframe

In [220]:
frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [221]:
frame.sum()

ball      24
pen       28
pencil    32
paper     36
dtype: int64

In [222]:
frame.mean()

ball      6.0
pen       7.0
pencil    8.0
paper     9.0
dtype: float64

In [223]:
frame.T.mean()

red        1.5
blue       5.5
yellow     9.5
white     13.5
dtype: float64

In [224]:
frame.describe()

Unnamed: 0,ball,pen,pencil,paper
count,4.0,4.0,4.0,4.0
mean,6.0,7.0,8.0,9.0
std,5.163978,5.163978,5.163978,5.163978
min,0.0,1.0,2.0,3.0
25%,3.0,4.0,5.0,6.0
50%,6.0,7.0,8.0,9.0
75%,9.0,10.0,11.0,12.0
max,12.0,13.0,14.0,15.0


## Sorting and Ranking

In [225]:
frame

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [227]:
frame.sort_index()

Unnamed: 0,ball,pen,pencil,paper
blue,4,5,6,7
red,0,1,2,3
white,12,13,14,15
yellow,8,9,10,11


In [237]:
frame.sort_index(axis = 1)

Unnamed: 0,ball,paper,pen,pencil
red,0,3,1,2
blue,4,7,5,6
yellow,8,11,9,10
white,12,15,13,14


In [230]:
frame.sort_index(ascending = False)

Unnamed: 0,ball,pen,pencil,paper
yellow,8,9,10,11
white,12,13,14,15
red,0,1,2,3
blue,4,5,6,7


In [240]:
frame.sort_values(by = ['pen','pencil'])

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [246]:
ser = pd.Series([5,0,3,8,4],
                index=['red','blue','yellow','white','green'])
ser

red       5
blue      0
yellow    3
white     8
green     4
dtype: int64

In [247]:
ser.sort_values()

blue      0
yellow    3
green     4
red       5
white     8
dtype: int64

In [248]:
ser.rank()

red       4.0
blue      1.0
yellow    2.0
white     5.0
green     3.0
dtype: float64

In [249]:
ser.rank(method = 'first')

red       4.0
blue      1.0
yellow    2.0
white     5.0
green     3.0
dtype: float64

In [250]:
ser.rank(ascending= False)

red       2.0
blue      5.0
yellow    4.0
white     1.0
green     3.0
dtype: float64

## Correlation and Covariance

In [251]:
frame2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],
                      index=['red','blue','yellow','white'],
                      columns=['ball','pen','pencil','paper'])
frame2

Unnamed: 0,ball,pen,pencil,paper
red,1,4,3,6
blue,4,5,6,1
yellow,3,3,1,5
white,4,1,6,4


In [252]:
frame2.corr()

Unnamed: 0,ball,pen,pencil,paper
ball,1.0,-0.276026,0.57735,-0.763763
pen,-0.276026,1.0,-0.079682,-0.361403
pencil,0.57735,-0.079682,1.0,-0.692935
paper,-0.763763,-0.361403,-0.692935,1.0


In [254]:
frame2.cov()

Unnamed: 0,ball,pen,pencil,paper
ball,2.0,-0.666667,2.0,-2.333333
pen,-0.666667,2.916667,-0.333333,-1.333333
pencil,2.0,-0.333333,6.0,-3.666667
paper,-2.333333,-1.333333,-3.666667,4.666667


In [255]:
frame2

Unnamed: 0,ball,pen,pencil,paper
red,1,4,3,6
blue,4,5,6,1
yellow,3,3,1,5
white,4,1,6,4


In [256]:
ser = pd.Series([0,1,2,3,9],
                index=['red','blue','yellow','white','green'])

In [257]:
ser

red       0
blue      1
yellow    2
white     3
green     9
dtype: int64

In [259]:
frame2.corrwith(ser)

# Using the corrwith() method, you can calculate the pairwise correlations
#between the columns or rows of a dataframe with a series or another 
#DataFrame()

ball      0.730297
pen      -0.831522
pencil    0.210819
paper    -0.119523
dtype: float64

## “Not a Number” Data

In the previous sections, you saw how easily missing data can be formed. They are
recognizable in the data structures by the NaN (Not a Number) value. So, having values
that are not defined in a data structure is quite common in data analysis.
However, pandas is designed to better manage this eventuality. In fact, in this
section, you will learn how to treat these values so that many issues can be obviated.
For example, in the pandas library, calculating descriptive statistics excludes NaN values
implicitly.

## Assigning a NaN value

In [4]:
ser = pd.Series([0,1,2,np.NaN,9],
                index=['red','blue','yellow','white','green'])

ser

red       0.0
blue      1.0
yellow    2.0
white     NaN
green     9.0
dtype: float64

In [5]:
ser['white']=None
ser

red       0.0
blue      1.0
yellow    2.0
white     NaN
green     9.0
dtype: float64

## Filtering out NA values 

There are various ways to eliminate the NaN values during data analysis. Eliminating them
by hand, element by element, can be very tedious and risky, and you’re never sure that
you eliminated all the NaN values. This is where the dropna() function comes to your aid

In [7]:
ser.dropna()

red       0.0
blue      1.0
yellow    2.0
green     9.0
dtype: float64

You can also directly perform the filtering function by placing notnull() in the
selection condition.

In [8]:
ser[ser.notnull()]

red       0.0
blue      1.0
yellow    2.0
green     9.0
dtype: float64

If you’re dealing with a dataframe, it gets a little more complex. If you use the
dropna() function on this type of object, and there is only one NaN value on a column or
row, it will eliminate it.

In [9]:
frame3 = pd.DataFrame([[6,np.nan,6],[np.nan,np.nan,np.nan],[2,np.nan,5]],
                      index = ['blue','green','red'],
                      columns = ['ball','mug','pen'])

frame3

Unnamed: 0,ball,mug,pen
blue,6.0,,6.0
green,,,
red,2.0,,5.0


In [10]:
frame3.dropna()

Unnamed: 0,ball,mug,pen


Therefore, to avoid having entire rows and columns disappear completely, you
should specify the how option, assigning a value of all to it. This tells the dropna()
function to delete only the rows or columns in which all elements are NaN.

In [11]:
frame3.dropna(how = 'all')

Unnamed: 0,ball,mug,pen
blue,6.0,,6.0
red,2.0,,5.0


## Filling in NaN Occurrences

Rather than filter NaN values within data structures, with the risk of discarding them
along with values that could be relevant in the context of data analysis, you can replace
them with other numbers. For most purposes, the fillna() function is a great choice.
This method takes one argument, the value with which to replace any NaN. It can be the
same for all cases.

In [12]:
frame3.fillna(0)

Unnamed: 0,ball,mug,pen
blue,6.0,0.0,6.0
green,0.0,0.0,0.0
red,2.0,0.0,5.0


Or you can replace NaN with different values depending on the column, specifying
one by one the indexes and the associated values.

In [22]:
frame3.fillna({'ball':4.0,'mug':0,'pen':5.5})

Unnamed: 0,ball,mug,pen
blue,6.0,0.0,6.0
green,4.0,0.0,5.5
red,2.0,0.0,5.0


## Hierarchical Indexing and Leveling

Hierarchical indexing is a very important feature of pandas, as it allows you to have
multiple levels of indexes on a single axis. It gives you a way to work with data in multiple
dimensions while continuing to work in a two-dimensional structure.
Let’s start with a simple example, creating a series containing two arrays of indexes,
that is, creating a structure with two levels.

In [24]:
mser = pd.Series(np.random.rand(8),
                 index=[['white','white','white','blue','blue','red','red','red'],
                        ['up','down','right','up','down','up','down','left']])
mser

white  up       0.377294
       down     0.555596
       right    0.885438
blue   up       0.652828
       down     0.801336
red    up       0.478715
       down     0.766311
       left     0.140618
dtype: float64

In [26]:
mser.index


MultiIndex([('white',    'up'),
            ('white',  'down'),
            ('white', 'right'),
            ( 'blue',    'up'),
            ( 'blue',  'down'),
            (  'red',    'up'),
            (  'red',  'down'),
            (  'red',  'left')],
           )

Through the specification of hierarchical indexing, selecting subsets of values is in a
certain way simplified.
In fact, you can select the values for a given value of the first index, and you do it in
the classic way:

In [27]:
mser['white']

up       0.377294
down     0.555596
right    0.885438
dtype: float64

Or you can select values for a given value of the second index, in the following
manner:

In [28]:
mser[:,'up']

white    0.377294
blue     0.652828
red      0.478715
dtype: float64

Intuitively, if you want to select a specific value, you specify both indexes.

In [29]:
mser['white','up']

0.37729383363061564

Hierarchical indexing plays a critical role in reshaping data and group-based
operations such as a pivot-table. For example, the data could be rearranged and used
in a dataframe with a special function called unstack(). This function converts the
series with a hierarchical index to a simple dataframe, where the second set of indexes is
converted into a new set of columns.

In [33]:
frame = mser.unstack()
frame

Unnamed: 0,down,left,right,up
blue,0.801336,,,0.652828
red,0.766311,0.140618,,0.478715
white,0.555596,,0.885438,0.377294


If what you want is to perform the reverse operation, which is to convert a dataframe
to a series, you use the stack() function

In [34]:
frame.stack()

blue   down     0.801336
       up       0.652828
red    down     0.766311
       left     0.140618
       up       0.478715
white  down     0.555596
       right    0.885438
       up       0.377294
dtype: float64

With dataframe, it is possible to define a hierarchical index both for the rows and for
the columns. At the time the dataframe is declared, you have to define an array of arrays
for the index and columns options

In [35]:
mframe = pd.DataFrame(np.random.randn(16).reshape(4,4),
                      index=[['white','white','red','red'], 
                             ['up','down','up','down']],
                      columns=[['pen','pen','paper','paper'],[1,2,1,2]])
mframe

Unnamed: 0_level_0,Unnamed: 1_level_0,pen,pen,paper,paper
Unnamed: 0_level_1,Unnamed: 1_level_1,1,2,1,2
white,up,0.692767,1.786107,-1.05257,0.250192
white,down,0.399649,-0.570093,1.274584,-0.076955
red,up,1.250175,-0.620771,-0.846264,-0.033622
red,down,0.658533,0.360975,-1.364607,-0.235822


## Reordering and Sorting Levels

Occasionally, you might need to rearrange the order of the levels on an axis or sort for
values at a specific level.
The swaplevel() function accepts as arguments the names assigned to the two
levels that you want to interchange and returns a new object with the two levels
interchanged between them, while leaving the data unmodified.

In [37]:
mframe.columns.names = ['objects','id']
mframe.index.names = ['colors','status']
mframe

Unnamed: 0_level_0,objects,pen,pen,paper,paper
Unnamed: 0_level_1,id,1,2,1,2
colors,status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
white,up,0.692767,1.786107,-1.05257,0.250192
white,down,0.399649,-0.570093,1.274584,-0.076955
red,up,1.250175,-0.620771,-0.846264,-0.033622
red,down,0.658533,0.360975,-1.364607,-0.235822


In [38]:
mframe.swaplevel('colors','status')

Unnamed: 0_level_0,objects,pen,pen,paper,paper
Unnamed: 0_level_1,id,1,2,1,2
status,colors,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
up,white,0.692767,1.786107,-1.05257,0.250192
down,white,0.399649,-0.570093,1.274584,-0.076955
up,red,1.250175,-0.620771,-0.846264,-0.033622
down,red,0.658533,0.360975,-1.364607,-0.235822


Instead, the sort_index() function orders the data considering only those
of a certain level by specifying it as parameter

In [39]:
mframe.sort_index(level='colors')

Unnamed: 0_level_0,objects,pen,pen,paper,paper
Unnamed: 0_level_1,id,1,2,1,2
colors,status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
red,down,0.658533,0.360975,-1.364607,-0.235822
red,up,1.250175,-0.620771,-0.846264,-0.033622
white,down,0.399649,-0.570093,1.274584,-0.076955
white,up,0.692767,1.786107,-1.05257,0.250192


## Summary Statistic by Level

Many descriptive statistics and summary statistics performed on a dataframe or on a
series have a level option, with which you can determine at what level the descriptive
and summary statistics should be determined.
For example, if you create a statistic at row level, you have to simply specify the level
option with the level name.

In [40]:
mframe.sum(level='colors')

objects,pen,pen,paper,paper
id,1,2,1,2
colors,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
white,1.092417,1.216014,0.222014,0.173236
red,1.908708,-0.259795,-2.210872,-0.269444


If you want to create a statistic for a given level of the column, for example, the id,
you must specify the second axis as an argument through the axis option set to 1.

In [41]:
mframe.sum(level='id', axis=1)

Unnamed: 0_level_0,id,1,2
colors,status,Unnamed: 2_level_1,Unnamed: 3_level_1
white,up,-0.359803,2.036298
white,down,1.674233,-0.647048
red,up,0.403911,-0.654392
red,down,-0.706074,0.125153


## Conclusions

This chapter introduced the pandas library. You learned how to install it and saw a
general overview of its characteristics.
You learned about the two basic structures data, called the series and dataframe,
along with their operation and their main characteristics. Especially, you discovered
the importance of indexing within these structures and how best to perform operations
on them. Finally, you looked at the possibility of extending the complexity of these
structures by creating hierarchies of indexes, thus distributing the data contained in
them into different sublevels.
In the next chapter, you learn how to capture data from external sources such as files,
and inversely, how to write the analysis results on them.