## 2.4 THE DATAFRAME

The DataFrame is the most common Pandas object. It can be thought of as Python’s way of storing spreadsheet-like data. Many of the features of the Series data structure carry over into the DataFrame.

2.4.1 Boolean Subsetting: DataFrames

Just as we were able to subset a Series with a boolean vector, so we can subset a DataFrame with a bool.

In [2]:
import pandas as pd

In [3]:
scientists = pd.read_csv('data/scientists.csv')

In [4]:
ages = scientists['Age']

print(ages)

0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64


In [5]:
# boolean vectors will subset rows
print(scientists[scientists['Age'] > scientists['Age'].mean()])

                   Name        Born        Died  Age     Occupation
1        William Gosset  1876-06-13  1937-10-16   61   Statistician
2  Florence Nightingale  1820-05-12  1910-08-13   90          Nurse
3           Marie Curie  1867-11-07  1934-07-04   66        Chemist
7          Johann Gauss  1777-04-30  1855-02-23   77  Mathematician


Because of how broadcasting works, if we supply a bool vector that is not the same as the number of rows in the dataframe, the maximum number of rows returned would be the length of the bool vector.

In [10]:
# 4 values passed as a bool vector
# 3 rows returned

print(scientists.loc[[True, True, False, True]])

                Name        Born        Died  Age    Occupation
0  Rosaline Franklin  1920-07-25  1958-04-16   37       Chemist
1     William Gosset  1876-06-13  1937-10-16   61  Statistician
3        Marie Curie  1867-11-07  1934-07-04   66       Chemist


In [7]:
print(scientists['Age'].mean())

59.125


In [8]:
print(scientists)

                   Name        Born        Died  Age          Occupation
0     Rosaline Franklin  1920-07-25  1958-04-16   37             Chemist
1        William Gosset  1876-06-13  1937-10-16   61        Statistician
2  Florence Nightingale  1820-05-12  1910-08-13   90               Nurse
3           Marie Curie  1867-11-07  1934-07-04   66             Chemist
4         Rachel Carson  1907-05-27  1964-04-14   56           Biologist
5             John Snow  1813-03-15  1858-06-16   45           Physician
6           Alan Turing  1912-06-23  1954-06-07   41  Computer Scientist
7          Johann Gauss  1777-04-30  1855-02-23   77       Mathematician


In [None]:
Table 2.3 Table of DataFrame Subsetting Methods

Syntax                               Selection Result

df[column_name]                      Single column

df[[column1, column2, ... ]]         Multiple columns

df.loc[row_label]                    Row by row index label (row name)

df.loc[[label1, label2, ...]]        Multiple rows by index label

df.iloc[row_number]                  Row by row number

df.iloc[[row1, row2, ...]]           Multiple rows by row number

df.ix[label_or_number]               Row by index label or number

df.ix[[lab_num1, lab_num2, ...]]     Multiple rows by index label or number

df[bool]                             Row based on

bool df[[bool1, bool2, ...]]         Multiple rows based on

bool df[start:stop:step]             Rows based on slicing notation

2.4.2 Operations Are Automatically Aligned and Vectorized (Broadcasting)

Pandas supports broadcasting, which comes from the numpy library.6 In essence, it describes what happens when performing operations between array-like objects, which the Series and DataFrame are. These behaviors depend on the type of object, its length, and any labels associated with the object.

   6 numpy library: http://www.numpy.org/

First let’s create a subset of our dataframes.

In [11]:
first_half = scientists[:4]
second_half = scientists[4:]

print(first_half)

                   Name        Born        Died  Age    Occupation
0     Rosaline Franklin  1920-07-25  1958-04-16   37       Chemist
1        William Gosset  1876-06-13  1937-10-16   61  Statistician
2  Florence Nightingale  1820-05-12  1910-08-13   90         Nurse
3           Marie Curie  1867-11-07  1934-07-04   66       Chemist


In [12]:
print(second_half)

            Name        Born        Died  Age          Occupation
4  Rachel Carson  1907-05-27  1964-04-14   56           Biologist
5      John Snow  1813-03-15  1858-06-16   45           Physician
6    Alan Turing  1912-06-23  1954-06-07   41  Computer Scientist
7   Johann Gauss  1777-04-30  1855-02-23   77       Mathematician


When we perform an action on a dataframe with a scalar, it will try to apply the operation on each cell of the dataframe. In this example, numbers will be multiplied by 2, and strings will be doubled (this is Python’s normal behavior with strings).

In [13]:
# multiply by a scalar

print(scientists * 2)

                                       Name                  Born  \
0        Rosaline FranklinRosaline Franklin  1920-07-251920-07-25   
1              William GossetWilliam Gosset  1876-06-131876-06-13   
2  Florence NightingaleFlorence Nightingale  1820-05-121820-05-12   
3                    Marie CurieMarie Curie  1867-11-071867-11-07   
4                Rachel CarsonRachel Carson  1907-05-271907-05-27   
5                        John SnowJohn Snow  1813-03-151813-03-15   
6                    Alan TuringAlan Turing  1912-06-231912-06-23   
7                  Johann GaussJohann Gauss  1777-04-301777-04-30   

                   Died  Age                            Occupation  
0  1958-04-161958-04-16   74                        ChemistChemist  
1  1937-10-161937-10-16  122              StatisticianStatistician  
2  1910-08-131910-08-13  180                            NurseNurse  
3  1934-07-041934-07-04  132                        ChemistChemist  
4  1964-04-141964-04-14  112     

If your dataframes are all numeric values and you want to “add” the values on a cell-by-cell basis, you can use the add method. The automatic alignment can be better seen in Chapter 4, when we concatenate dataframes together.