# Practice with Pandas

Author: Greg Wray  
2025-APR-01

In [None]:
import numpy as np
import pandas as pd
import sys
from datetime import date
import pyarrow
import re

In [None]:
# record current Python and library versions for reproducibility
print("session date", date.today())
print("python ", sys.version)
print("numpy ", np.__version__)
print("pandas ", pd.__version__)
print("pyarrow ", pyarrow.__version__)

## Pandas

**Pandas** is a library of data structures and functions for working with **tabular** data in Python. Like NumPy, pandas is widely used in data science, machine learning, and scientific computing. 

Pandas is designed for *heterogenous* data organized by column (data frame or spreadsheet). This is in contrast to NumPy, which is designed for *homogenous* data in every dimension (vector, matrix, array, and tensor). 

Pandas provides two primary data structures. A **DataFrame** is desgined to hold tabular data, similar to a data frame or Tibble in R. DataFrames are the central way data is stored and manipulated in pandas. Each column in a DataFrame is the same length and contains values of the same data type. A **Series** is a 1-dimensional, ordered container that is similar to a NumPy ndarray. You can think of a series as a single column of a DataFrame. A Series holds values of a single data type, and will automatically "up-cast" data to homogenize data types (e.g., any integers will be converted to floats if there are any float values). 

Both DataFrame and Series data structures provide an automatic integer index that references rows and items, respectively. This default index can be replaced with a user-defined index based on integers or strings. In addition, DataFrames provide automatic integer column names, although in practice these are typically replaced with string names.  

Pandas Series and DataFrames support **vectorized operations** for mathematics, functions, and conditions, similar to ndarrays in NumPy. Refer to the examples below.

Pandas provides a large set of functions and methods to facilitate working with DataFrame and Series objects. The performance-critical components of pandas are written in C and Cython. Use the prefix `pd.` to access pandas functions; attributes and methods do not need the prefix since they are appended directly to a pandas data object.

## Data frames from scratch

You can create a DataFrame completely by hand using a construtor function, although in practice this is not common. The simplest approach is to pass a nested list to the constructor. Passing a dictionary also works, but be aware that each dictionary is treated as a column (see examples below).

In [None]:
# create a DataFrame by hand from a nested list
#    note that pandas automatically assigns integer names for rows and columns
dataA = [['SPU_000003',4.90,0.18,1.71,0.19,0.43], ['SPU_000007',2.39,0.75,6.17,0.01,0.07],
        ['SPU_000008',3.03,0.24,0.74,0.38,0.64], ['SPU_000011',0.47,0.18,0.48,0.48,0.72],
        ['SPU_000013',4.36,0.23,3.45,0.06,0.22], ['SPU_000016',0.29,0.37,0.44,0.50,0.74]]
dfA = pd.DataFrame(dataA)
dfA

In [None]:
# create a DataFrame by hand from a dictionary
#    note that keys are used as column names by default; rows are assigned integer names
dataB = {'SPU_000003' : [4.90,0.18,1.71,0.19,0.43], 'SPU_000007' : [2.39,0.75,6.17,0.01,0.07],
        'SPU_000008' : [3.03,0.24,0.74,0.38,0.64], 'SPU_000011' : [0.47,0.18,0.48,0.48,0.72],
        'SPU_000013' : [4.36,0.23,3.45,0.06,0.22], 'SPU_000016' : [0.29,0.37,0.44,0.50,0.74]}
dfB = pd.DataFrame(dataB)
dfB

In [None]:
# a better way to create a DataFrame by hand from a dictionary
#    use keys as column names
dataC = {'gene' : ['SPU_000003','SPU_000007','SPU_000008','SPU_000011','SPU_000013','SPU_000016'], 
        'time_1' : [4.90, 2.39, 3.03, 0.47, 4.36, 0.29],
        'time_2' : [0.18, 0.75, 0.24, 0.18, 0.23, 0.37],
        'time_3' : [1.71, 6.17, 0.74, 0.48, 3.45, 0.44],
        'time_4' : [0.19, 0.01, 0.38, 0.48, 0.06, 0.50],
        'time_5' : [0.43, 0.07, 0.64, 0.72, 0.22, 0.74]}
dfC = pd.DataFrame(dataC)
dfC

## Data import and export

In most situations, you will create a DataFrame by importing values from a file. You may also want to export a DataFrame to a standard file format for archival purposes or to share. Pandas provides functions to carry out these tasks. 

**Importing data.** `read_csv()` is the most common tool for data import. Be sure to prefix with `.pd` to generate a pandas DataFrame. 
                                                        
`read_csv()` can accommodate any separator, in spite of its name.

By default, `read_csv()` creates an index column of consecutive integers for each row. If a header row is present in the input file, it will be used for column names; otherwise, columns will be named with consecutive integers. See later sub-sections for how to change these behaviors. 

In [None]:
# import a .csv file 
df = pd.read_csv('countries.csv')

In [None]:
# import a .tsv file by defining the separator
df = pd.read_csv('countries.tsv', sep='\t')

**Import a subset of columns.** You can specificy which rows to import using column names if a header row is present. You can specify which columns to import using column position regardless of whether names are available.  

In [None]:
## import a subset of columns by name
df = pd.read_csv('countries.csv', usecols=['country_name', 'species_total', 'species_endemic'])

In [None]:
## import a subset of columns by name
df = pd.read_csv('countries.csv', usecols=[0, 8, 10])

**Exporting data.** You can export DataFrames to a variety of file formats. While `.csv` is highly portable and perhaps the most appropriate format for sharing data, it is worth considering other options for large data sets. Two common formats are Parquet and Feather, which provide compression and preserve data types. To access these file formats, import the `pyarrow` library.

In [None]:
# export to .csv
df.to_csv('my_file.csv')

In [None]:
# export to .tsv by defining the separator
df.to_csv('my_file.tsv', sep='\t')

In [None]:
# export to parquet format
df.to_parquet('my_file.parquet')

**Managing the index column.** The default behavior of `.to_csv()` is to include both the row and column indexes in the output file. If you then import the data back into a pandas DataFrame, the column names will be incorporated as expected, but a *new* integer index column will be added and the old one will become the first data column. To avoid this, you can either save to `.csv` without the index or import to DataFrame without the old index. You also have the option of exporting and importing without column names and of supplying different column names when importing. 

In [None]:
# export to .csv without the index column
df.to_csv('my_file.csv', index=False)

In [None]:
# drop an unnecessary index column when reading a .csv
df = pd.read_csv('my_file.csv', index_col=[0])

**Managing column names.**  By default, `read_csv()` uses names in a header row (if present) for column names in the resulting DataFrame. Similarly, `to_csv()` uses column names in a DataFrame to create a header row in the output file. You can change these behaviors and work around multi-line headers. 

In [None]:
# import from .csv dropping column names (they will be replaced by integer names)
df = pd.read_csv('my_file.csv', header=0)

In [None]:
# import from .csv and use the second header row as column names
df = pd.read_csv('my_file.csv', header=1)

In [None]:
# import from .csv dropping column names and replace with names that you supply
df = pd.read_csv('my_file.csv', header=0, names=['col1', 'col2', 'col3'])

In [None]:
# export to .csv without column names
df.to_csv('my_file.csv', header=False)

**Converting data types.** By default, `read_csv()` will infer data types for each column during import. You can change the data type of individual columns at any time after importing.

In [None]:
# change the data type of a column
df['area_total'] = df['area_total'].astype(float)

## Viewing DataFrame contents
Note that the DataFrame includes an index column that is automatically generated when created; it is the left-most column and lacks a column name. By default, Pandas assigns a consecutive positive integer to the index for each row when creating a DataFrame.

In [None]:
# view the first 5 (or specified number of) rows
df.head()

In [None]:
# view the last 5 (or specified number of) rows
df.tail()

In [None]:
# set the number of columns to display to 100
pd.set_option('display.max_columns', 100)

In [None]:
# view a random sample of rows
df.sample(5)

In [None]:
# view a random fraction of rows
df.sample(frac=0.05)

In [None]:
# use the optional state argument for reproducibility when randomly sampling
df.sample(5, random_state=42)

## Accessing information about a DataFrame
Pandas provides an extensive set of attributes and methods to retrieve information about a DataFrame. Note that calling an attribute requires only the name, while calling a method requires the name followed by `()`.

In [None]:
# retrieve the dimensions of a DataFrame; attribute
df.shape   # rows, columns

In [None]:
# retrieve column names; attribute; returns an Index object
df.columns

In [None]:
# retrieve a list of column names
list(df.columns)

In [None]:
# retrieve a Series of column names and their associated data types; attribute
#    iterables, including strings, are described as 'object'
df.dtypes

In [None]:
# retrieve a list of index values; attribute
#   for the default integer index this will be a range-derived object
df.index

In [None]:
# get a quick view of a data frame; method
#   gotcha: there is also an attribute .info that returns a glance of contents
df.info()

In [None]:
# retrieve basic statistics for numeric columns (ignores non-numeric columns); method
df.describe()

In [None]:
# retrieve basic statistics for Boolean and factor-like columns; method
df[['continent_name', 'bioregion_name']].describe()

## Indexing a DataFrame
Pandas provides several approaches to indexing. The two recommended approaches use index *positions* or *values* with the methods `.iloc[]` and `.loc[]`, respectively. For approaches to index just columns or rows, see subsequent sections.

Referencing one or more rows will return a DataFrame; referencing a single column will return a Series while referencing multiple columns will return a DataFrame. To retrieve a single column as a DataFrame, wrap the name or integer in a list of length 1 (compare first two code blocks below). 

**Indexing with iloc[ ].** This approach references the *position* of rows and columns rather than the value of their indexes. It is an integer-based approach that uses standard square-bracket, zero-based indexing; it accepts slices, negative values, and steps. 

Passing a single argument indexes rows; two arguments index rows and columns respectively; `:,` followed by one argument indexes columns. 

Because indexing with `.iloc[]` uses the position of rows, it will almost always return a different result before and after sorting. 

In [None]:
# retrieve a row as a Series (all values up-cast to string)
df.iloc[2]

In [None]:
# to retrieve a row as a DataFrame, pass index value as list
df.iloc[[12]]

In [None]:
# to retrieve multiple rows, extend the list
df.iloc[[12, 29, 7]]

In [None]:
# retrieve value from a single cell; preserves data type
df.iloc[1, 5]

In [None]:
# slices work as expected
df.iloc[:5, :5]

In [None]:
# negative values also work
df.iloc[-5:, -4:]

In [None]:
# retrieve a single column; returns a Series
df.iloc[:, 5]

In [None]:
# retrieve a single column as a DataFrame by placing in a list of one
df.iloc[:, [5]]

In [None]:
# to retrieve multiple columns, extend the list
df.iloc[:, [0, 14, 5]]

**Indexing with .loc[ ].** This approach references the *values* of indexes rather than the position of rows and columns. By default, Pandas creates an integer index column, which `.loc[]` treats as row names. If no column names are provided when a DataFrame is created, pandas will create a header composed of integer values, which `.loc[]` treats as column names. The best way to think about `.loc[]` indexing is that everything is a name, even if it is an integer.

In practice, columns usually have string names. This means that, with the default integer index, `.loc[]` indexing combines numerals for rows with strings for columns. Keep in mind that the numerals are names and not positions.
                                                                                                            With `.loc[] indexing, single values and slices (including open slices and strides) are allowed. Slices can be used with both numeral and string names. Negative numbers are not allowed, because they are not names. 
                
It's important to keep in mind that `.loc[]` indexing is not the same as normal square-bracket indexing. There are some similarities: it is zero-based and allows slices. However, there are three key differences: the end value of slices are *included*, negative values are not allowed, and strings can be used as values.  

You can pass a list of row and/or column names even if they are integers, because `.loc[]` treats them as values rather than positions. Importantly, this behavior means that indexing with `.loc[]` will return the same result before and after sorting, unlike `.iloc[]`. 

With `.loc[]` indexing, passing a single argument indexes rows; two arguments index rows and columns respectively; and `:,` followed by one argument indexes columns.

In [None]:
# retrieve a single row as a Series
df.loc[5]

In [None]:
# retrieve a single row as a DataFrame
df.loc[[20]]

In [None]:
# retrieve multiple rows (including discontinuous) by extending the list
df.loc[[4, 20, 3]]

In [None]:
# retrieve a range of rows; note that the end value of the slice is included
df.loc[0:3]

In [None]:
# using negative values with .loc[] does not work as expected (or throws an error)
df.loc[-7:-4]

In [None]:
# retrieve a single column using a string name; returns a Series
df.loc[:, 'country_name']

In [None]:
# retrieve a range of columns using a slice
df.loc[:, 'country_name' : 'bioregion_name']

In [None]:
# retrieve discontinuous columns using a list
df.loc[:, ['country_name', 'continent_name', 'species_endemic']]

In [None]:
# retrive a subset of rows and columns
df.loc[7:10,['country_name', 'continent_name', 'species_endemic']]

In [None]:
# retrieve a subset of discontinuous rows and columns using two lists
df.loc[[7, 6, 11, 2], ['country_name', 'continent_name', 'species_endemic']]

**Setting the index column.** The default integer index column can be replaced by any other column, which will then be used for indexing rows. You can also specify multiple columns to act as the index, although this is not commonly needed.

Although you can set any column to be the index, it is most useful when there is a column that can be used for `.loc[]` indexing (see examples below). This can make code more readable, especially for filtering and joins. 

Pandas allows duplicate values in the index column, but in general an index is most useful when each value is unique. To learn how to check for unique values, "Summarizing data", below. 

In [None]:
# set the index to a different column
df2 = df.set_index('country_name')
df2.head()

In [None]:
# now you can use .loc indexing with more intuitive names for rows and columns
df2.loc[['Iceland', 'Japan'], ['continent_name', 'bioregion_name']]

In [None]:
# retrive entire rows by name
df2.loc[['Iceland', 'Japan'], :]

## Filtering columns of a DataFrame

**Referencing columns by name.** Columns can indicated by enclosing a single name or list of names in square brackets. 

Passing a single name returns a Series; passing a list of one  or more names returns a DataFrame. When passing a list, columns can be discontinuous and/or out of order. 

In [None]:
# view a single column by name; returns a Series
df['country_name']

In [None]:
# view a single column by name; returns a DataFrame
df[['country_name']]

In [None]:
# view a discontinuous subset of columns
df[['country_name','species_total', 'species_resident', 'species_endemic']]

Comprehensions can be used to generate lists based on filtering column names. This is particularly useful for large DataFrames where columns are named according to a consistent convention.

In [None]:
# filter column names by substring using a list comprehension
df[[c for c in df.columns if 'species' in c]]

Pandas allows you to reference a single column using just its name without square brackets. This approach is *not* recommended and may be deprecated in future releases of pandas. However, you may encounter this approach in older code, so it's useful to know how to recognize it. 

In [None]:
# view a single column using a "bare" name; not recommended
df.country_name

**Referencing columns by position.** It is possible to reference columns by position using `.columns[]` with standard numerical square bracket indexing; it accepts slices and negative values but not steps. 

In general, this method is *not recommended*. Using numbers is more brittle than using names and the code is more difficult to read. If you do want to use position to refer to columns, `.iloc[]` indexing (above) is a better approach.

In [None]:
# retrive the names of columns by position
df.columns[3:5]

In [None]:
# retrieve the values of columns by position
df[df.columns[3:5]]

In [None]:
# view a numeric slice of columns
df[df.columns[-5:]]    # returns last 5 columns

**Referencing columns by data type.** In some cases, it can be useful to retrive values from all columns of a particular data type. 

In [None]:
# view all columns of a particular data type
df.select_dtypes(int) 

## Filtering rows of a DataFrame

Pandas provides several ways to filter rows based on values in one or multiple columns. Any expression that evalutes to `True` or `False` can be used as a condition, including compound conditions. 

When referring to a column in a condition, precede with the name of the DataFrame and use square brackets with the column name (see examples below). 

**Filter based on numeric value.** Equality and inequality conditions work as expected. To filter for maximum and minimum values in a column, see "Summarizing Data" below.

In [None]:
# filter using a comparison operator
df[df['species_endemic'] >= 100]

In [None]:
# filter based on result of arithmetic expression involving two columns
df[df['species_resident'] / df['species_total'] < 0.1]

**Filter based on string value.** Pandas allows you to use regex when filtering (requires importing the standard Python `re` library).

In [None]:
# filter for rows containing a specific value in one column 
df[df['continent_name'] == 'Africa']

In [None]:
# filter based on a substring
df[df['country_name'].str.contains('New')]

In [None]:
# filter for names ending in 'land' using regex
df[df['country_name'].str.contains('land$', regex=True)]

In [None]:
# filter for names consisting of two words using regex
df[df['country_name'].str.contains('. .', regex=True)]

**Filter based on membership.** First define a comparison list or set, then use `.isin()` to test for membership.  

In [None]:
# filter based on membership; first define a list to test against
western_hemisphere = ['South America', 'North America']
df[df['continent_name'].isin(western_hemisphere)]

**Filter based on compound condition.** Note that compound conditions use the `&` and `|` operators rather than `and` and `or` keywords. Negate any Boolean expression with `~` and wrapping in round brackets

In [None]:
# filter for conditions in two columns 
df[(df['continent_name'] == 'Africa') & (df['species_endemic'] > 0)]

In [None]:
# filter for rows where the country is not in Europe
df[~(df['continent_name'] == 'Europe')]

**Filtering compound conditions with .query().** Using `.query()` provides a simpler and more readable syntax when writing compound queries. Note that the syntax uses `and` and `or` keywords. To negate all or part of an expression, use the `~` operator.

A useful feature of `.query()` is that you can use variables in conditions. Any variable in scope can be referenced by preceding its identifier with `@`.

In [None]:
# filter for conditions in two columns; equivalent to query 2 cells above
df.query("(continent_name == 'Africa') and (species_endemic > 10)")

In [None]:
# filter by condition using an external variable
threshold = 15
df.query('(eba_count > @threshold)')

**Removing duplicates.**  Rows are treated as duplicates if and only if they contain the same value in every column.  

Removing duplicates does not alter the original DataFrame; assign to a new identifier to keep the results. The default behavior when removing duplicates is to keep the first occurrence and drop all others, but this can be changed. 

In [None]:
# create a boolean index of duplicated rows
duplicates = df.duplicated()
duplicates

In [None]:
# count the number of duplicated rows
duplicates.sum()

In [None]:
# keep the first occurrence of duplicated rows and drop all others
df2 = df.drop_duplicates()

In [None]:
# drop all duplicated rows (i.e., retain no rows that are duplicated)
df_deduplicated = df.drop_duplicates(duplicates, keep=False)

In [None]:
# create a Boolean index based on duplicate values in a single column
df_deduplicated = df.drop_duplicates(subset['country_name'], keep='first')

## Updating values

Becoming proficient with indexing gives you the ability to update values in precise ways. The examples below are illustrate basic approaches to updating using a condition or by replacement. 

Most methods for updating values in a DataFrame operate *in place*, so be careful! 

**Replace values in a column based on a condition.** In the following example, the condition and the updated value refer to the same column, but this is not required. Note the need to specify the column to update explicitly for this reason.

In [None]:
# change all continent_name values matching 'Europe' to 'Europa'
df.loc[df['continent_name'] == 'Europe', 'continent_name'] = 'Europa'
df.head(10)

**Replace values in a column from an ordered iterable.**  It is possible to update values in an entire column directly from a base Python list or tuple. The length of the data structure must be the same as the length of columns in the DataFrame. Make sure values are in the correct order (aligned) with rows in the DataFrame. 

In [None]:
# replace values in the eba_count with new values (in this case, meaningless)
new_values = range(0, 70)
df['eba_count'] = new_values
df.iloc[0:8, -5:]

## Summarizing data
Pandas offers methods for retrieving the usual set of summary statistics. Queries for single value will return an integer or float; the same statistic applied to multiple columns will return a Series; and multiple statistics applied to multiple columns will return a DataFrame.

Pass a list of column names to retrieve values for more than one column. Use the `.agg()` method to apply multiple summary statistics.

In [None]:
# retrieve the mean value of a column
df['species_total'].mean()

In [None]:
# retrieve the count of rows containing non-null values for a given column
df['biodiv_index'].count()

In [None]:
# any of these queries can be run on multiple columns at once; returns a Series
df[['species_total', 'species_resident', 'species_endemic']].max()

In [None]:
# run multiple methods on a set of columns, use the .agg() method
df[['species_total', 'species_resident', 'species_endemic']].agg(['mean', 'min', 'max'])

**Unique values.** Pandas provides methods for counting, listing, and tallying unique values by column. 

In [None]:
# retrieve unique values from a column; returns an ndarray
df['continent_name'].unique()

In [None]:
# retrieve the count of unique values from a column
df['continent_name'].nunique()

In [None]:
# retrieve the count of each unique value from a column; returns a Series
df['continent_name'].value_counts()

In [None]:
# retrieve the proportion of each unique value from a column; returns a Series
df['continent_name'].value_counts(normalize=True)

In [None]:
# retrieve the count of each unique value combination from multiple columns; returns a multi-index Series
df[['continent_name','bioregion_name']].value_counts()

In [None]:
# retrieve the count of each unique value combination from multiple columns as a DataFrane
df[['continent_name','bioregion_name']].value_counts().reset_index()

**Grouping.** The `.groupby()` method works similarly to grouping in R. This allows you to apply any of the summary statistics to values in one or more columns after grouping by some other variable. Use the `.agg` method to apply multiple summary statistics; pass a list of column names to retrieve values for more than one column.

In [None]:
# retrieve the mean number of species per country from each continent
df.groupby('continent_name')[['species_total']].mean()

In [None]:
# generate multiple summary statistics by categorical variable
df.groupby('continent_name')[['species_total']].agg(['mean', 'min', 'max'])

In [None]:
# generate multiple summary statistics by categorical variable
df.groupby('continent_name')[['species_total', 'species_endemic']].agg(['mean', 'min', 'max'])

## Modifying the structure of a DataFrame

DataFrames allow you to feely add or remove rows and columns. This is an important difference from NumPy ndarrays, which have a fixed size.

Most operations that change the number of rows or columns do *not* alter the DataFrame in-place. Use assignment to create a new DataFrame with the requested change.

For **removing duplicate rows**, see "Filtering a DataFrame", above. For **removing rows with missing values**, see "Working with missing data", below. 

**Add a column containing values from an ordered iterable.** Pandas allows you to append columns directly from base Python lists or tuples. The length of the data structure must be the same as the length of columns in the DataFrame. Make sure values are in the correct order (aligned) with rows in the DataFrame. In addition, make sure that the name of the new column does not duplicate an existing name; if there is a match in names, pandas will replace values in that column with values from the iterable.

In [None]:
# add a new column from a list
capitals =['Buenos Aires', 'Canberra', 'Vienna', 'Bridgetown', 'City of Brussels', 'Belmopan', 'Thimphu', 'Gaborone', 'Brasília', 'Phnom Penh', 'Ottawa', 'Santiago', 'Beijing', 'Bogotá', 'San José', 'Zagreb', 'Santo Domingo', 'Quito', 'Cairo', 'London', 'Helsinki', 'Paris', 'Berlin', 'Athens', 'Hong Kong', 'Budapest', 'Reykjavík', 'New Delhi', 'Jakarta', 'Tehran', 'Dublin', 'Rome', 'Kingston', 'Tokyo', 'Nairobi', 'Riga', 'Kuala Lumpur', 'Valletta', 'Mexico City', 'Rabat', 'Naypyidaw', 'Windhoek', 'Kathmandu', 'Amsterdam', 'Wellington', 'Oslo', 'Panama City', 'Port Moresby', 'Lima', 'Lisbon', 'San Juan', 'Doha', 'Bucharest', 'Edinburgh', 'Singapore', 'Ljubljana', 'Pretoria', 'Seoul', 'Madrid', 'Colombo', 'Stockholm', 'Bern', 'Taipei', 'Dodoma City', 'Bangkok', 'Ankara', 'Abu Dhabi', 'Washington', 'Hanoi', 'Charlotte Amalie']
df['capital'] = capitals
df.iloc[0:5, -5:]

**Add a column containing values computed from existing columns.** This operation is analogous to `mutate()` in dplyr. 

In [None]:
# create a new column and populate with values computed with an expression
df['fraction_endemic'] = df['species_endemic'] / df['species_total']
df.iloc[0:5, -5:]

**Add a single row.** It is possible to use a Series or a base Python list or tuple to hold the new values. In each case, make sure the number of items matches the DataFrame and that items are in the correct order. Mismatch in overlall length or data type at any position will generate an error.

In [None]:
# add a row to an existing DataFrame
new_row = ['Rohan', 'Middle Earth', 780, 0, 0, 'Gondorian', 2, 3, 352, 289, 0, 0, 0.32, 1399, 4512, 0.1, 3.2, 24.3, 0, 0, 0, 0, 0, 'Kingdom of Rohan', 'Kingdom of Rohan', 'RHN']
df.loc[len(df)] = new_row
df.iloc[-5:, :]

**Delete rows or columns.** To delete duplicate rows, see "Removing duplicates" above; to delete rows with missing values, see "Working with missing data" below. 

In [None]:
# delete a column based on its name
#   to delete multiple columns, pass a list of names
df = df.drop('country_iso', axis=1)

In [None]:
# delete a row based on index
#   to delete multiple rows, pass a list of values (can be discontinuous)
df = df.drop(1)

In [None]:
# delete rows based on condition
#   filter without assignment first to test the condition!
df = df[df['species_endemic'] >= 100]

**Swap rows and columns.** Use the `.T` or `.transpose()` method, similar to NumPy.

In [None]:
df5 = df.T
df5

## Sorting a DataFrame

Sorting refers to the *sequence* of rows. Sorting on one or more columns moves entires rows up or down. If multiple columns are specified, pandas sorts on the first, then uses the second to break any ties, and so forth. 

In most cases, the order of rows will change if you sort. This means that the integer index values will no longer be in numerical order. You can use `.loc[ ]` to locate rows by their original position or `.iloc[ ]` to locate them by their value.

Note that sorting does not alter the original DataFrame; to capture the sorted DataFrame, assign to a new DataFrame. 

**Sorting based on values in a column or columns.**

In [None]:
# sort based on values in a single column
df8 = df.sort_values('species_total')
df8

In [None]:
# sort in descending order
df.sort_values('species_total', ascending=False)

In [None]:
# sort based on values in a multiple columns
df.sort_values(['bioregion_count','species_total'])

In [None]:
# specify different sort orders when working with multiple columns
df.sort_values(['bioregion_count','species_total'], ascending=[0,1])

**Sorting based on the index.** When using the default integer index, it is sometimes useful to reset values so that they are more intuitive (e.g., in ascending order following a sort). By default, resetting the index preserves the original index as a new data column and creates a new index.

In [None]:
# sort based on the index
df.sort_index()

In [None]:
# sort and then renumber the index
df.sort_values('species_total').reset_index()

In [None]:
# sort and then renumber the index, discarding the previous index values
df.sort_values('species_total').reset_index(drop=True)

## Working with missing data

Pandas ignores missing values by default when computing summary statistics (similar to Dplyr in R). This can be convenient, but sometimes you want to eliminate rows with missing values in any or a particular column.   

In [None]:
# use the .isna method to return a bool for every cell queried
df[['species_total', 'biodiv_index']].isna()

In [None]:
# apply .count to get a count of the missing values in a single column; does not work for multiple columns
df[['biodiv_index']].isna().value_counts()

In [None]:
# to drop rows with missing values in a specified column or columns
df['biodiv_index'].dropna()

In [None]:
# to replace missing values in a specified column or columns
df[['biodiv_index']].fillna(0)