# Working with Pandas DataFrames

In this notebook we will look at Pandas *DataFrames*. A DataFrame is a 2-dimensional labelled data structure with columns of data that can be of different types. Like a Pandas Series, it supports both position-based and index-based data access.

To start off, we import the Pandas package. We can import it as *pd* for shorthand.

In [1]:
import pandas as pd

## Creating Pandas DataFrames

The easiest way to manually create a DataFrame is to pass the *DataFrame()* method a dictionary of lists, where each list will be a column. Notice that Jupyter Notebooks will render frames in a tabular format.

In [2]:
# the data we will use to populate our frame
countries = ["Argentina", "Australia", "Brazil", "Canada"]
regions = ["South America", "Oceania", "South America", "North America"]
pops = [43.59, 23.99, 200.4, 35.99]
life_exp = [75.77, 82.09, 73.12, 80.99]
# create the dictionary of lists
d1 = {"Country":countries, "Region":regions, "Population":pops, "Life Exp":life_exp}
# create the DataFrame
df1 = pd.DataFrame(d1)

In [3]:
# display the DataFrame's contents
df1

Unnamed: 0,Country,Region,Population,Life Exp
0,Argentina,South America,43.59,75.77
1,Australia,Oceania,23.99,82.09
2,Brazil,South America,200.4,73.12
3,Canada,North America,35.99,80.99


A DataFrame's associated *shape* attribute tells the number of rows and columns it has:

In [4]:
df1.shape

(4, 4)

If we do not provide data for an index, Pandas automatically creates numeric index labels for the rows, starting at 0. Alternatively, we could pass an explicit list of values to use as an index (e.g. country names in this case).

In [5]:
# create the dictionary of lists
d2 = {"Region":regions, "Population":pops, "Life Exp":life_exp}
# create the DataFrame, specifying a set of index labels too
df2 = pd.DataFrame(d2, index=countries)
df2

Unnamed: 0,Region,Population,Life Exp
Argentina,South America,43.59,75.77
Australia,Oceania,23.99,82.09
Brazil,South America,200.4,73.12
Canada,North America,35.99,80.99


Like a Series, a DataFrame has an associated *index* attribute, which allows us to access the index values alone:

In [6]:
df2.index

Index(['Argentina', 'Australia', 'Brazil', 'Canada'], dtype='object')

We can use the *in* operator to check whether or not a particular index exists in the DataFrame:

In [7]:
"Australia" in df2.index

True

In [8]:
"Ireland" in df2.index

False

The *columns* attribute of a DataFrame returns back an ordered list of its column names (excluding the index):

In [9]:
df2.columns

Index(['Region', 'Population', 'Life Exp'], dtype='object')

We can use the *in* operator to check whether or not a particular column exists in a DataFrame:

In [10]:
"Population" in df2.columns

True

In [11]:
"GDP" in df2.columns

False

## Loading Pandas DataFrames

In [12]:
df3 = pd.read_csv("world_data.csv")
# check the size of the dataset which we have loaded (rows, columns)
df3.shape

(21, 6)

We can display the first *n* values in the Pandas by calling the associated *head()* function:

In [13]:
# display first 5 rows
df3.head(5)

Unnamed: 0,Country,Region,Population,Life Exp,Landlocked,Language
0,Argentina,South America,43.59,75.77,No,Spanish
1,Australia,Oceania,23.99,82.09,No,English
2,Brazil,South America,200.4,73.12,No,Portuguese
3,Canada,North America,35.99,80.99,No,English
4,Chad,Africa,11.63,49.81,Yes,Arabic


We can also tell the read_csv() function to use one of the columns in the CSV file as the index for the rows in our data. 

In [14]:
df4 = pd.read_csv("world_data.csv", index_col="Country")
# check the size of the dataset which we have loaded
df4.shape

(21, 5)

In [15]:
# display first 10 rows
df4.head(10)

Unnamed: 0_level_0,Region,Population,Life Exp,Landlocked,Language
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Argentina,South America,43.59,75.77,No,Spanish
Australia,Oceania,23.99,82.09,No,English
Brazil,South America,200.4,73.12,No,Portuguese
Canada,North America,35.99,80.99,No,English
Chad,Africa,11.63,49.81,Yes,Arabic
China,Asia,1357.0,74.87,No,Chinese
Egypt,Africa,90.37,70.48,No,Arabic
Germany,Europe,81.46,80.24,No,German
Ireland,Europe,4.64,80.15,No,English
Japan,Asia,126.26,84.36,No,Japanese


In [16]:
# check the column names
df4.columns

Index(['Region', 'Population', 'Life Exp', 'Landlocked', 'Language'], dtype='object')

In [17]:
# check the row index labels
df4.index

Index(['Argentina', 'Australia', 'Brazil', 'Canada', 'Chad', 'China', 'Egypt',
       'Germany', 'Ireland', 'Japan', 'Mexico', 'New Zealand', 'Niger',
       'Nigeria', 'Paraguay', 'Portugal', 'South Korea', 'Spain',
       'Switzerland', 'United Kingdom', 'United States'],
      dtype='object', name='Country')

## Accessing Columns in DataFrames

Pandas provides a number of different ways in which to access the elements in a DataFrame (columns, rows, or individual values).

Columns in a DataFrame can be accessed using the index label of the column to give a single column Series. Note the original row index labels are retained:

In [18]:
df4["Population"]

Country
Argentina           43.59
Australia           23.99
Brazil             200.40
Canada              35.99
Chad                11.63
China             1357.00
Egypt               90.37
Germany             81.46
Ireland              4.64
Japan              126.26
Mexico             127.58
New Zealand          4.66
Niger               18.05
Nigeria            186.99
Paraguay             6.78
Portugal            10.29
South Korea         51.71
Spain               47.13
Switzerland          8.12
United Kingdom      65.10
United States      321.07
Name: Population, dtype: float64

In [19]:
df4["Region"]

Country
Argentina         South America
Australia               Oceania
Brazil            South America
Canada            North America
Chad                     Africa
China                      Asia
Egypt                    Africa
Germany                  Europe
Ireland                  Europe
Japan                      Asia
Mexico            North America
New Zealand             Oceania
Niger                    Africa
Nigeria                  Africa
Paraguay          South America
Portugal                 Europe
South Korea                Asia
Spain                    Europe
Switzerland              Europe
United Kingdom           Europe
United States     North America
Name: Region, dtype: object

We can easily select multiple columns by passing a list of column labels. Note that when multiple columns are selected the returned object is a DataFrame rather than a Series.

In [20]:
df4[["Region", "Population", "Language"]]

Unnamed: 0_level_0,Region,Population,Language
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Argentina,South America,43.59,Spanish
Australia,Oceania,23.99,English
Brazil,South America,200.4,Portuguese
Canada,North America,35.99,English
Chad,Africa,11.63,Arabic
China,Asia,1357.0,Chinese
Egypt,Africa,90.37,Arabic
Germany,Europe,81.46,German
Ireland,Europe,4.64,English
Japan,Asia,126.26,Japanese


We can also use numeric positions to access individual columns, using iloc and square brackets notation:

In [21]:
# return all rows and the 2nd column
df4.iloc[:,1]

Country
Argentina           43.59
Australia           23.99
Brazil             200.40
Canada              35.99
Chad                11.63
China             1357.00
Egypt               90.37
Germany             81.46
Ireland              4.64
Japan              126.26
Mexico             127.58
New Zealand          4.66
Niger               18.05
Nigeria            186.99
Paraguay             6.78
Portugal            10.29
South Korea         51.71
Spain               47.13
Switzerland          8.12
United Kingdom      65.10
United States      321.07
Name: Population, dtype: float64

We can also use the *iloc[]* operator to perform slicing, allowing us to select multiple columns based on their position. The result will be a DataFrame with the specified columns:

In [22]:
# return all rows and first two columns
df4.iloc[:,0:2]

Unnamed: 0_level_0,Region,Population
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Argentina,South America,43.59
Australia,Oceania,23.99
Brazil,South America,200.4
Canada,North America,35.99
Chad,Africa,11.63
China,Asia,1357.0
Egypt,Africa,90.37
Germany,Europe,81.46
Ireland,Europe,4.64
Japan,Asia,126.26


## Accessing Rows in DataFrames

We can access rows of a DataFrame in several different ways.

We can access a single row of a DataFrame, by using the *loc[]* operator and the index label of the row. This returns a Series:

In [23]:
df4.loc["Spain"]

Region         Europe
Population      47.13
Life Exp        83.49
Landlocked         No
Language      Spanish
Name: Spain, dtype: object

In [24]:
df4.loc["Ireland"]

Region         Europe
Population       4.64
Life Exp        80.15
Landlocked         No
Language      English
Name: Ireland, dtype: object

Notice that the row is returned as a Series with the column names as index labels. We can access individual elements using these. The Series also retains its original row index label, in the attribute *name*.

In [25]:
row = df4.loc["Ireland"]
print("%s is in %s and has a population of %.2f million" % (row.name, row["Region"], row["Population"]))

Ireland is in Europe and has a population of 4.64 million


Alternatively, we can access a single row by numeric position using *iloc[]*, counting from zero:

In [26]:
# return the 1st row
df4.iloc[0]

Region        South America
Population            43.59
Life Exp              75.77
Landlocked               No
Language            Spanish
Name: Argentina, dtype: object

In [27]:
# return the last row
df4.iloc[-1]

Region        North America
Population           321.07
Life Exp              78.51
Landlocked               No
Language            English
Name: United States, dtype: object

In [28]:
# return the 7th row
row = df4.iloc[6]
# print some values for this row
print("%s is in %s and has a population of %.2f million" % (row.name, row["Region"], row["Population"]))

Egypt is in Africa and has a population of 90.37 million


Both methods can be used to specify multiple rows to access. The result is a DataFrame in either case:

In [29]:
# use a list to specify multiple index labels
df4.loc[["Ireland", "Australia", "Portugal", "Spain"]]

Unnamed: 0_level_0,Region,Population,Life Exp,Landlocked,Language
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Ireland,Europe,4.64,80.15,No,English
Australia,Oceania,23.99,82.09,No,English
Portugal,Europe,10.29,80.68,No,Portuguese
Spain,Europe,47.13,83.49,No,Spanish


In [30]:
# use slicing to specify multiple row positions
df4.iloc[0:4]

Unnamed: 0_level_0,Region,Population,Life Exp,Landlocked,Language
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Argentina,South America,43.59,75.77,No,Spanish
Australia,Oceania,23.99,82.09,No,English
Brazil,South America,200.4,73.12,No,Portuguese
Canada,North America,35.99,80.99,No,English


In [31]:
# use a list to specify multiple row positions
positions = [5, 1, 3, 15, 11]
df4.iloc[positions]

Unnamed: 0_level_0,Region,Population,Life Exp,Landlocked,Language
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
China,Asia,1357.0,74.87,No,Chinese
Australia,Oceania,23.99,82.09,No,English
Canada,North America,35.99,80.99,No,English
Portugal,Europe,10.29,80.68,No,Portuguese
New Zealand,Oceania,4.66,80.67,No,English


## Accessing Individual Values

We can extend the row/column access approaches above to access individual elements in a DataFrame:

In [32]:
# access individual value by column index, then row index
df4["Life Exp"]["Portugal"]

80.68

In [33]:
# access individual value by row index, then column index
df4.loc["Spain"]["Population"]

47.13

In [34]:
# access individual value by row position, then column index
df4.iloc[10]["Region"]

'North America'

In [35]:
# access individual value by row position, then column position
df4.iloc[10][2]

75.05

Alternatively, we can use the *at[]* operator if we need to get a single value in a DataFrame. We pass in a row/column index pair as arguments to the operator:

In [36]:
df4.at["Ireland", "Region"]

'Europe'

In [37]:
df4.at["Germany", "Language"]

'German'

We can also use this *at[]* operator to modify individual values in a DataFrame:

In [38]:
# set the value
df4.at["Ireland", "Population"] = 4.71
# check the value
df4.at["Ireland", "Population"]

4.71

## Applying Conditions to DataFrames

We might want to filter the values in a DataFrame, to reduce it to a subset of the original values based on some condition applied to one or more columns in the frame. We can do this by indexing with a boolean expression.

In [39]:
# check which rows satisfy this condition 
df4["Life Exp"] > 80

Country
Argentina         False
Australia          True
Brazil            False
Canada             True
Chad              False
China             False
Egypt             False
Germany            True
Ireland            True
Japan              True
Mexico            False
New Zealand        True
Niger             False
Nigeria           False
Paraguay          False
Portugal           True
South Korea        True
Spain              True
Switzerland        True
United Kingdom     True
United States     False
Name: Life Exp, dtype: bool

In [40]:
# get subset of rows which satisfy this condition 
df4[df4["Life Exp"] > 80]

Unnamed: 0_level_0,Region,Population,Life Exp,Landlocked,Language
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Australia,Oceania,23.99,82.09,No,English
Canada,North America,35.99,80.99,No,English
Germany,Europe,81.46,80.24,No,German
Ireland,Europe,4.71,80.15,No,English
Japan,Asia,126.26,84.36,No,Japanese
New Zealand,Oceania,4.66,80.67,No,English
Portugal,Europe,10.29,80.68,No,Portuguese
South Korea,Asia,51.71,83.23,No,Korean
Spain,Europe,47.13,83.49,No,Spanish
Switzerland,Europe,8.12,82.5,Yes,German


We can combine several different conditions using a boolean operator like AND (&) or OR (|). Note that each condition is surrounded in parentheses:

In [41]:
# check which rows satisfy both conditions
(df4["Life Exp"] > 80) & (df4["Region"] == "Europe")

Country
Argentina         False
Australia         False
Brazil            False
Canada            False
Chad              False
China             False
Egypt             False
Germany            True
Ireland            True
Japan             False
Mexico            False
New Zealand       False
Niger             False
Nigeria           False
Paraguay          False
Portugal           True
South Korea       False
Spain              True
Switzerland        True
United Kingdom     True
United States     False
dtype: bool

In [42]:
# get subset of rows which satisfy both conditions
df4[(df4["Life Exp"] > 80) & (df4["Region"] == "Europe")]

Unnamed: 0_level_0,Region,Population,Life Exp,Landlocked,Language
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Germany,Europe,81.46,80.24,No,German
Ireland,Europe,4.71,80.15,No,English
Portugal,Europe,10.29,80.68,No,Portuguese
Spain,Europe,47.13,83.49,No,Spanish
Switzerland,Europe,8.12,82.5,Yes,German
United Kingdom,Europe,65.1,80.09,No,English


In [43]:
# get subset of rows which satisfy either condition
df4[(df4["Life Exp"] < 50) | (df4["Life Exp"] > 83)]

Unnamed: 0_level_0,Region,Population,Life Exp,Landlocked,Language
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Chad,Africa,11.63,49.81,Yes,Arabic
Japan,Asia,126.26,84.36,No,Japanese
South Korea,Asia,51.71,83.23,No,Korean
Spain,Europe,47.13,83.49,No,Spanish


## DataFrame Statistics

We can use the *describe()* function to get a basic summary of the numeric values in a frame, which is returned as a new DataFrame with statistics for each columns:

In [44]:
df4.describe()

Unnamed: 0,Population,Life Exp
count,21.0,21.0
mean,134.422857,75.215238
std,291.620326,10.356273
min,4.66,49.81
25%,11.63,74.87
50%,47.13,80.09
75%,126.26,80.99
max,1357.0,84.36


We can also get individual statistics for each column containing numeric values. These get return as a new Series, one element for each column in the original frame:

In [45]:
# get the mean values for columns 
df4.mean(numeric_only=True)

Population    134.422857
Life Exp       75.215238
dtype: float64

In [46]:
# get the median values for columns 
df4.median(numeric_only=True)

Population    47.13
Life Exp      80.09
dtype: float64

In [47]:
# get the ranges for the numeric values (i.e. minimum and maximum)
df4.min(numeric_only=True), df4.max(numeric_only=True)

(Population     4.66
 Life Exp      49.81
 dtype: float64,
 Population    1357.00
 Life Exp        84.36
 dtype: float64)

In [48]:
# get the standard deviations
df4.std(numeric_only=True)

Population    291.620326
Life Exp       10.356273
dtype: float64

## Sorting DataFrames

To sort a DataFrame, we call its associated *sort_values()* function and specify a single column to sort by:

In [49]:
# sort by population, lowest to highest
df4.sort_values(by="Population")

Unnamed: 0_level_0,Region,Population,Life Exp,Landlocked,Language
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
New Zealand,Oceania,4.66,80.67,No,English
Ireland,Europe,4.71,80.15,No,English
Paraguay,South America,6.78,76.99,Yes,Spanish
Switzerland,Europe,8.12,82.5,Yes,German
Portugal,Europe,10.29,80.68,No,Portuguese
Chad,Africa,11.63,49.81,Yes,Arabic
Niger,Africa,18.05,55.13,Yes,French
Australia,Oceania,23.99,82.09,No,English
Canada,North America,35.99,80.99,No,English
Argentina,South America,43.59,75.77,No,Spanish


By default values are ordered in ascending order. We can sort in descending order, by specifying the argument *ascending=False*:

In [50]:
# sort by population, highest to lowest
df4.sort_values(by="Population", ascending=False)

Unnamed: 0_level_0,Region,Population,Life Exp,Landlocked,Language
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
China,Asia,1357.0,74.87,No,Chinese
United States,North America,321.07,78.51,No,English
Brazil,South America,200.4,73.12,No,Portuguese
Nigeria,Africa,186.99,51.3,No,English
Mexico,North America,127.58,75.05,No,Spanish
Japan,Asia,126.26,84.36,No,Japanese
Egypt,Africa,90.37,70.48,No,Arabic
Germany,Europe,81.46,80.24,No,German
United Kingdom,Europe,65.1,80.09,No,English
South Korea,Asia,51.71,83.23,No,Korean


We can also specify a list of multiple columns to sort by, which can be used to resolve ties:

In [51]:
# sort by region first, then population
df4.sort_values(by=["Region","Population"])

Unnamed: 0_level_0,Region,Population,Life Exp,Landlocked,Language
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Chad,Africa,11.63,49.81,Yes,Arabic
Niger,Africa,18.05,55.13,Yes,French
Egypt,Africa,90.37,70.48,No,Arabic
Nigeria,Africa,186.99,51.3,No,English
South Korea,Asia,51.71,83.23,No,Korean
Japan,Asia,126.26,84.36,No,Japanese
China,Asia,1357.0,74.87,No,Chinese
Ireland,Europe,4.71,80.15,No,English
Switzerland,Europe,8.12,82.5,Yes,German
Portugal,Europe,10.29,80.68,No,Portuguese


In [52]:
# sort by region first (ascending order), then population (descending order)
df4.sort_values(by=["Region","Population"], ascending=[True, False])

Unnamed: 0_level_0,Region,Population,Life Exp,Landlocked,Language
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Nigeria,Africa,186.99,51.3,No,English
Egypt,Africa,90.37,70.48,No,Arabic
Niger,Africa,18.05,55.13,Yes,French
Chad,Africa,11.63,49.81,Yes,Arabic
China,Asia,1357.0,74.87,No,Chinese
Japan,Asia,126.26,84.36,No,Japanese
South Korea,Asia,51.71,83.23,No,Korean
Germany,Europe,81.46,80.24,No,German
United Kingdom,Europe,65.1,80.09,No,English
Spain,Europe,47.13,83.49,No,Spanish


We can also sort a DataFrame based on its index labels, by calling *sort_index()*:

In [53]:
df4.sort_index()

Unnamed: 0_level_0,Region,Population,Life Exp,Landlocked,Language
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Argentina,South America,43.59,75.77,No,Spanish
Australia,Oceania,23.99,82.09,No,English
Brazil,South America,200.4,73.12,No,Portuguese
Canada,North America,35.99,80.99,No,English
Chad,Africa,11.63,49.81,Yes,Arabic
China,Asia,1357.0,74.87,No,Chinese
Egypt,Africa,90.37,70.48,No,Arabic
Germany,Europe,81.46,80.24,No,German
Ireland,Europe,4.71,80.15,No,English
Japan,Asia,126.26,84.36,No,Japanese


Ranking or sorting data can provide useful insights and facilitate various data analysis tasks.
For instance, we might often want to perform Top-*N* Analysis - i.e. sorting data to identify the top or bottom values.

In [54]:
# get top-3 countries with highest population
df4.sort_values(by="Population", ascending=False).head(3)

Unnamed: 0_level_0,Region,Population,Life Exp,Landlocked,Language
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
China,Asia,1357.0,74.87,No,Chinese
United States,North America,321.07,78.51,No,English
Brazil,South America,200.4,73.12,No,Portuguese


In [55]:
# get bottom-3 countries with lowest population
df4.sort_values(by="Population", ascending=True).head(3)

Unnamed: 0_level_0,Region,Population,Life Exp,Landlocked,Language
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
New Zealand,Oceania,4.66,80.67,No,English
Ireland,Europe,4.71,80.15,No,English
Paraguay,South America,6.78,76.99,Yes,Spanish
