# Pandas

# Installing and Using Pandas
Installation of Pandas on your system requires NumPy to be installed, and if building the library from source, requires the appropriate tools to compile the C and Cython sources on which Pandas is built. Details on this installation can be found in the Pandas documentation.

Once Pandas is installed, you can import it and check the version:

In [1]:
import pandas
pandas.__version__

'1.0.5'

# Introducing Pandas Objects

At the very basic level, Pandas objects can be thought as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices. As we will see during the course of this chapter, Pandas provide a host of useful tools, methods, and functionality on top of the basic data structures, but nearly everything that follows will require an understanding of what these structures are. Thus, before we go any further, let's introduce these three fundamental Pandas data structures: the Series, DataFrame, and Index.

We will start our code sessions with the standard NumPy and Pandas imports:

In [2]:
import numpy as np
import pandas as pd

# Series

A Series is a single vector of data (like a NumPy array) with an index that labels each element in the vector.

In [3]:
counts = pd.Series([111,222,333,444,555,666,777,888,999])
counts

0    111
1    222
2    333
3    444
4    555
5    666
6    777
7    888
8    999
dtype: int64

In [4]:
counts.values

array([111, 222, 333, 444, 555, 666, 777, 888, 999], dtype=int64)

In [5]:
counts.index

RangeIndex(start=0, stop=9, step=1)

In [6]:
Friends = pd.Series([111,222,333,444,555,666,777,888,999], 
    index=['Muskan','Nagma','Seema','Pavitra','Pooja','Sanjana','Heena','Smran','Arshiya'])

Friends

Muskan     111
Nagma      222
Seema      333
Pavitra    444
Pooja      555
Sanjana    666
Heena      777
Smran      888
Arshiya    999
dtype: int64

In [7]:
Friends['Heena']

777

In [8]:
Friends[0]

111

In [9]:
Friends.name = 'counts'
Friends.index.name = 'Group'
Friends

Group
Muskan     111
Nagma      222
Seema      333
Pavitra    444
Pooja      555
Sanjana    666
Heena      777
Smran      888
Arshiya    999
Name: counts, dtype: int64

In [10]:
np.log(Friends)

Group
Muskan     4.709530
Nagma      5.402677
Seema      5.808142
Pavitra    6.095825
Pooja      6.318968
Sanjana    6.501290
Heena      6.655440
Smran      6.788972
Arshiya    6.906755
Name: counts, dtype: float64

In [11]:
Friends[Friends>1000]

Series([], Name: counts, dtype: int64)

In [12]:
Friends_dict = {'Muskan':111, 'Nagma':222,'Seema':333,'Pavitra':444,'Pooja':555,'Sanjana':666,'Heena':777,'Smran':888,'Arshiya':999}
print(Friends_dict)
pd.Series(Friends_dict)

{'Muskan': 111, 'Nagma': 222, 'Seema': 333, 'Pavitra': 444, 'Pooja': 555, 'Sanjana': 666, 'Heena': 777, 'Smran': 888, 'Arshiya': 999}


Muskan     111
Nagma      222
Seema      333
Pavitra    444
Pooja      555
Sanjana    666
Heena      777
Smran      888
Arshiya    999
dtype: int64

# Series as generalized NumPy array

From what we've seen so far, it may look like the Series object is basically interchangeable with a one-dimensional NumPy array. The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

This explicit index definition gives the Series object additional capabilities.



# Series as specialized dictionary
In this way, you can think of a Pandas Series a bit like a specialization of Python dictionary. A dictionary is a structure that maps arbitrary keys to set of arbitrary values, and Series is a structure which maps typed keys to a set of typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

The Series-as-dictionary analogy can be made even more clear by constructing a Series object directly from Python dictionary:

# DataFrame: bi-dimensional Series with two (or more) indices


A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).

The DataFrame has both row and column index; it can be thought of as a dict of Series (all sharing the same index).

DataFrame's can be created in different ways. One of them is from dict's.

In [13]:
data = {"Places": ['United States','India','Brazil','Russis','France','Spain','United Kingdom'],
        "Corona Cases": [9490941,8314421,8314421,1693454,152763,1259366,1073882],
        "Recovery": [562763,7656478,5028216,1266931,120714,659259,371432],
        "Deaths": [236319,123662,160548,29217,38289,38289,47250]}
print(data)
data = pd.DataFrame(data)
data

{'Places': ['United States', 'India', 'Brazil', 'Russis', 'France', 'Spain', 'United Kingdom'], 'Corona Cases': [9490941, 8314421, 8314421, 1693454, 152763, 1259366, 1073882], 'Recovery': [562763, 7656478, 5028216, 1266931, 120714, 659259, 371432], 'Deaths': [236319, 123662, 160548, 29217, 38289, 38289, 47250]}


Unnamed: 0,Places,Corona Cases,Recovery,Deaths
0,United States,9490941,562763,236319
1,India,8314421,7656478,123662
2,Brazil,8314421,5028216,160548
3,Russis,1693454,1266931,29217
4,France,152763,120714,38289
5,Spain,1259366,659259,38289
6,United Kingdom,1073882,371432,47250


In [14]:
df = pd.DataFrame(data, columns=["Places", "Corona Cases" ,"Deaths","Recovery"])
df

Unnamed: 0,Places,Corona Cases,Deaths,Recovery
0,United States,9490941,236319,562763
1,India,8314421,123662,7656478
2,Brazil,8314421,160548,5028216
3,Russis,1693454,29217,1266931
4,France,152763,38289,120714
5,Spain,1259366,38289,659259
6,United Kingdom,1073882,47250,371432


In [15]:
df['Cases'] = df.Deaths / df.Recovery
df

Unnamed: 0,Places,Corona Cases,Deaths,Recovery,Cases
0,United States,9490941,236319,562763,0.419926
1,India,8314421,123662,7656478,0.016151
2,Brazil,8314421,160548,5028216,0.031929
3,Russis,1693454,29217,1266931,0.023061
4,France,152763,38289,120714,0.317188
5,Spain,1259366,38289,659259,0.058079
6,United Kingdom,1073882,47250,371432,0.12721


In [16]:
df['Serie_aligned'] = pd.Series(range(7), index=[0,1,2, 3, 4,5,6])
df

Unnamed: 0,Places,Corona Cases,Deaths,Recovery,Cases,Serie_aligned
0,United States,9490941,236319,562763,0.419926,0
1,India,8314421,123662,7656478,0.016151,1
2,Brazil,8314421,160548,5028216,0.031929,2
3,Russis,1693454,29217,1266931,0.023061,3
4,France,152763,38289,120714,0.317188,4
5,Spain,1259366,38289,659259,0.058079,5
6,United Kingdom,1073882,47250,371432,0.12721,6


In [17]:
df.to_dict()

{'Places': {0: 'United States',
  1: 'India',
  2: 'Brazil',
  3: 'Russis',
  4: 'France',
  5: 'Spain',
  6: 'United Kingdom'},
 'Corona Cases': {0: 9490941,
  1: 8314421,
  2: 8314421,
  3: 1693454,
  4: 152763,
  5: 1259366,
  6: 1073882},
 'Deaths': {0: 236319,
  1: 123662,
  2: 160548,
  3: 29217,
  4: 38289,
  5: 38289,
  6: 47250},
 'Recovery': {0: 562763,
  1: 7656478,
  2: 5028216,
  3: 1266931,
  4: 120714,
  5: 659259,
  6: 371432},
 'Cases': {0: 0.4199263277791895,
  1: 0.0161512904497342,
  2: 0.03192941592007981,
  3: 0.023061240114891815,
  4: 0.3171877329887171,
  5: 0.058078843064713566,
  6: 0.12721036421202267},
 'Serie_aligned': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6}}

In [18]:
pd.DataFrame(df.to_dict())

Unnamed: 0,Places,Corona Cases,Deaths,Recovery,Cases,Serie_aligned
0,United States,9490941,236319,562763,0.419926,0
1,India,8314421,123662,7656478,0.016151,1
2,Brazil,8314421,160548,5028216,0.031929,2
3,Russis,1693454,29217,1266931,0.023061,3
4,France,152763,38289,120714,0.317188,4
5,Spain,1259366,38289,659259,0.058079,5
6,United Kingdom,1073882,47250,371432,0.12721,6


# DataFrame as specialized dictionary
Similarly, we can also think of a DataFrame as a specialization of dictionary. Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data. For example, asking for the 'area' attribute returns the Series object containing the areas we saw earlier:

# From a list of dicts
Any list of dictionaries can be made into a DataFrame. We'll use a simple list comprehension to create some data:

In [19]:
data = [{'a': i, 'b':10* i}for i in range(7)]
print(data)
pd.DataFrame(data)

[{'a': 0, 'b': 0}, {'a': 1, 'b': 10}, {'a': 2, 'b': 20}, {'a': 3, 'b': 30}, {'a': 4, 'b': 40}, {'a': 5, 'b': 50}, {'a': 6, 'b': 60}]


Unnamed: 0,a,b
0,0,0
1,1,10
2,2,20
3,3,30
4,4,40
5,5,50
6,6,60


In [20]:
pd.DataFrame([{'aa': 1, 'bb': 2}, {'bb': 3, 'cc': 6}])

Unnamed: 0,aa,bb,cc
0,1.0,2,
1,,3,6.0


In [21]:
pd.DataFrame(np.random.randint(2, 12),
             columns=['Human', 'Humnity'],
             index=['a', 'b', 'c'])


Unnamed: 0,Human,Humnity
a,8,8
b,8,8
c,8,8


# The Pandas Index Object

We have seen here that both the Series and DataFrame objects contain an explicit index that lets you take reference and modify data. This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set (technically a multi-set, as Index objects may contain repeated values). Those views have some interesting consequences in the operations available on Index objects. As a simple example, let's construct an Index from a list of integers:

In [22]:
ind = pd.Index([22,33,44,55,66,77,88,99])
ind

Int64Index([22, 33, 44, 55, 66, 77, 88, 99], dtype='int64')

# Index as immutable array
The Index in many ways operates like an array. For example, we can use standard Python indexing notation to retrieve values or slices:

In [23]:
ind[5]

77

In [24]:
ind[:]

Int64Index([22, 33, 44, 55, 66, 77, 88, 99], dtype='int64')

In [25]:
ind[1:]

Int64Index([33, 44, 55, 66, 77, 88, 99], dtype='int64')

In [26]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

8 (8,) 1 int64


In [28]:
ind[1] = 1

TypeError: Index does not support mutable operations

# Operating on Data in Pandas

One of the essential pieces of NumPy is the ability to perform quick element-wise operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.). Pandas inherits much of this functionality from NumPy, and the ufuncs are key to this.

Pandas includes a couple useful twists, however: for unary operations like negation and trigonometric functions, these ufuncs will preserve index and column labels in the output, and for binary operations such as addition and multiplication, Pandas will automatically align indices when passing the objects to the ufunc. This means that keeping the context of data and combining data from different sources–both potentially error-prone tasks with raw NumPy arrays–become essentially foolproof with Pandas. We will additionally see the well-defined operations between one-dimensional Series structures and two-dimensional DataFrame structures.

# Ufuncs: Index Preservation
Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas Series and DataFrame objects. Let's start by defining a simple Series and DataFrame on which to demonstrate this:

In [29]:
rng = np.random.RandomState(20)
ser = pd.Series(rng.randint(0, 20, 2))
ser

0     3
1    15
dtype: int32

In [30]:
dfr = pd.DataFrame(rng.randint(0, 11, (6, 4)),
                  columns=['Aero', 'Cs', 'ME', 'CI'])
dfr

Unnamed: 0,Aero,Cs,ME,CI
0,10,9,4,6
1,7,2,0,6
2,8,5,10,10
3,3,0,6,6
4,0,9,5,7
5,5,2,6,10


In [31]:
np.exp(ser)

0    2.008554e+01
1    3.269017e+06
dtype: float64

In [32]:
np.sin(dfr * np.pi / 4)

Unnamed: 0,Aero,Cs,ME,CI
0,1.0,0.707107,1.224647e-16,-1.0
1,-0.7071068,1.0,0.0,-1.0
2,-2.449294e-16,-0.707107,1.0,1.0
3,0.7071068,0.0,-1.0,-1.0
4,0.0,0.707107,-0.7071068,-0.707107
5,-0.7071068,1.0,-1.0,1.0


#  Universal Functions: Index Alignment
For binary operations on two Series or DataFrame objects, Pandas will align indices in the process of performing the operation. This is very convenient when working with incomplete data, as we'll see in some of the examples that follow.

# Index alignment in Series
As an example, suppose we are combining two different data sources, and find only the top three US states by area and the top three US states by population:

In [33]:
area = pd.Series({'sirsi': 1723337, 'Khazi Galli': 695662,
                  'Kasturbanagar': 423967}, name='area')
population = pd.Series({'Khazi Galli': 38332521, 'Kasturbanagar': 26448193,
                        'Mangalore': 19651127}, name='population')
print(area)
population


sirsi            1723337
Khazi Galli       695662
Kasturbanagar     423967
Name: area, dtype: int64


Khazi Galli      38332521
Kasturbanagar    26448193
Mangalore        19651127
Name: population, dtype: int64

In [34]:
population / area

Kasturbanagar    62.382669
Khazi Galli      55.102221
Mangalore              NaN
sirsi                  NaN
dtype: float64

In [35]:
area.index | population.index

Index(['Kasturbanagar', 'Khazi Galli', 'Mangalore', 'sirsi'], dtype='object')

In [36]:
A = pd.Series([4,8,6], index=[0, 1, 2])
B = pd.Series([2,4,8], index=[1, 2, 3])
print(A)
print(B)
B
A + B

0    4
1    8
2    6
dtype: int64
1    2
2    4
3    8
dtype: int64


0     NaN
1    10.0
2    10.0
3     NaN
dtype: float64

In [37]:
A.add(B, fill_value=0)

0     4.0
1    10.0
2    10.0
3     8.0
dtype: float64

# Ufuncs: Operations Between DataFrame and Series
When performing operations between a DataFrame and a Series, the index and column alignment is similarly maintained. Operations between a DataFrame and a Series are similar to operations between two-dimensional and one-dimensional NumPy array. Consider one common operation, where we find the difference of a two-dimensional array and one of its rows:

# Data wrangling
Getting the data in the shape that we want is the single most time consuming task in the life of the Data Scientist. Sometimes it can be the most frustrating.

# Merge operations
By merging we mean combining different data sets by linking rows with one or more keys. The basic syntax is very simple.

In [38]:
df

Unnamed: 0,Places,Corona Cases,Deaths,Recovery,Cases,Serie_aligned
0,United States,9490941,236319,562763,0.419926,0
1,India,8314421,123662,7656478,0.016151,1
2,Brazil,8314421,160548,5028216,0.031929,2
3,Russis,1693454,29217,1266931,0.023061,3
4,France,152763,38289,120714,0.317188,4
5,Spain,1259366,38289,659259,0.058079,5
6,United Kingdom,1073882,47250,371432,0.12721,6


In [39]:
df2 = pd.DataFrame({"Places": ["United States", "India", "Brazil"], "Population": [1000000, 2000000, 3000000]})
df2

Unnamed: 0,Places,Population
0,United States,1000000
1,India,2000000
2,Brazil,3000000


In [40]:
df.merge(df2)  # merge is smart! If there are overlapping names, it uses those for the merge

Unnamed: 0,Places,Corona Cases,Deaths,Recovery,Cases,Serie_aligned,Population
0,United States,9490941,236319,562763,0.419926,0,1000000
1,India,8314421,123662,7656478,0.016151,1,2000000
2,Brazil,8314421,160548,5028216,0.031929,2,3000000


In [41]:
df3 = pd.DataFrame({"Places": ["United States", "India"], "Population": [1000000, 2000000]})
df3
df.merge(df3, right_on='Places', left_on='Places')

Unnamed: 0,Places,Corona Cases,Deaths,Recovery,Cases,Serie_aligned,Population
0,United States,9490941,236319,562763,0.419926,0,1000000
1,India,8314421,123662,7656478,0.016151,1,2000000


In [42]:
'df4 = pd.DataFrame({"Places": ["United States", "India"], "Population": ["100000", "200000","50000"]})
df.merge(df4, how='outer')

SyntaxError: EOL while scanning string literal (<ipython-input-42-5a3a54c47abf>, line 1)

# Combining data with overlap
Sometimes some data is missing, and it can be "patched" with another dataset. Let's take a look.

In [43]:
serie_a = pd.Series([np.nan, 10.5, np.nan, 20.5, 30.5, np.nan],
                     index=['Blue', 'White', 'Black', 'Blue', 'Red', 'Yellow'])
serie_b = pd.Series(np.arange(len(serie_a), dtype=np.float64),
                 index=['Blue', 'White', 'Black', 'Blue', 'Red', 'Yellow'])

In [44]:
serie_a

Blue       NaN
White     10.5
Black      NaN
Blue      20.5
Red       30.5
Yellow     NaN
dtype: float64

In [45]:
serie_b

Blue      0.0
White     1.0
Black     2.0
Blue      3.0
Red       4.0
Yellow    5.0
dtype: float64

In [46]:
pd.Series(np.where(pd.isnull(serie_a), serie_b, serie_a), index=serie_a.index)

Blue       0.0
White     10.5
Black      2.0
Blue      20.5
Red       30.5
Yellow     5.0
dtype: float64

In [47]:
serie_a.combine_first(serie_b)

Blue       0.0
White     10.5
Black      2.0
Blue      20.5
Red       30.5
Yellow     5.0
dtype: float64