# Pandas

This notebook introduces Pandas, a library for loading and manipulating structured data. It is similar to NumPy in that it provides efficient representations of tabular data and a variety of mathematical operations you can perform with them. It is built on top of NumPy, and in general NumPy operations work with Pandas. However, Pandas is useful when viewing and manipulating datasets.

Pandas consists of two main data types: Series and DataFrame.

In [54]:
import pandas as pd # Common abbreviation when importing Pandas
import numpy as np

## Series objects

Series objects are equivalent to one-dimensional NumPy arrays. They represent a sequence of values. Their functionality is somewhere between a Python dictionary and list, which we will see below:



In [55]:
# Series objects are similar to NumPy arrays, with one major difference:
my_arr = np.array([10, 9, 8, 7])
my_arr

array([10,  9,  8,  7])

In [56]:
my_series = pd.Series([10, 9, 8, 7])
my_series

0    10
1     9
2     8
3     7
dtype: int64

In [57]:
# In both cases, we see an array-like structure that stores a sequence of values.
# Both can be accessed by index:
print(my_arr[0], my_series[0])

10 10


In [58]:
# Both support slicing:
print("Array:", my_arr[1:3])
print()
print("Series:")
print(my_series[1:3])

Array: [9 8]

Series:
1    9
2    8
dtype: int64


In [59]:
# Series objects can have math performed on them like NumPy arrays:
my_series + 100

0    110
1    109
2    108
3    107
dtype: int64

In [60]:
my_series + my_series

0    20
1    18
2    16
3    14
dtype: int64

In [61]:
# Series and array objects can even be added together. The result is a Series:
my_arr + my_series

0    20
1    18
2    16
3    14
dtype: int64

In [62]:
# The difference is that Series objects have an index, while arrays do not.
# When we look up elements in array, we are looking them up by their position
# in the array. Here, we are retrieving the 2nd element from the left:
my_arr[2]

8

In [63]:
# Here, we are retrieving the element at index 2:
my_series[2]

8

In [64]:
# The difference may not seem significant at first. But we can create a Series
# with any index, even one which is not in order, with noncontiguous values, or
# that does not start with 0:
my_series_2 = pd.Series([10, 9, 8, 7], index=[5, 6, 7, 8])
my_series_2

5    10
6     9
7     8
8     7
dtype: int64

In [65]:
# With this new index, we can no longer access elements at index 0, because there
# is no index 0 in the series:
my_series_2[0]

KeyError: 0

In [None]:
# If we want the first element in the series, we have to retrieve it by its index:
my_series_2[5]

10

In [None]:
# Another example:
my_series_3 = pd.Series([10, 9, 8, 7], index=[2, 9, 100, 5])
my_series_3[100]

8

In [None]:
# Indices can be of any type:
my_series_4 = pd.Series([10, 9, 8, 7], index=["a", "zz", "b", "w"])
my_series_4['zz']

9

In [None]:
# In this way, Series objects are similar to a dictionary.

In [None]:
# However, slicing on Series objects works more like NumPy arrays.
# Here, we see a series with an unusual index.
my_series = pd.Series([10, 9, 8, 7], index=[9, 8, 7, 6])
print("my_series:")
print(my_series)
print()

# However, when we slice, we begin with the 0th element (by position) and continue
# up to but not including the 2nd element (by position).
my_series[0:2]

my_series:
9    10
8     9
7     8
6     7
dtype: int64



9    10
8     9
dtype: int64

In [None]:
# This is a somwhat ambiguous situation: the same operator (the [] operator) behaves
# differently depending on whether we are asking for specific values or ranges
# of values. What if we want to look up a specific value by position, or try to
# slice the Series by index?
#
# Pandas provides two attributes on Series objects that allow us to do this:
#
#   loc: Look up elements and ranges by index
#   iloc: Look up elements and ranges by position

In [None]:
# Here is loc on a complex series:
my_series = pd.Series([1, 2, 3, 9, 100, 25, 31, 5, 6], index=[6, 99, 3, 2, 8, 5, 0, 7, 4])

print('my_series')
print(my_series)
print()
print()

print('Output:')
my_series.loc[99]

my_series
6       1
99      2
3       3
2       9
8     100
5      25
0      31
7       5
4       6
dtype: int64


Output:


2

In [None]:
# Here is loc with slicing.
# Note that, unlike with positional slicing, the final value (in this case 4)
# is included in the output.
#
# Unlike positional slicing, Pandas can never be sure what the next index will
# be. In this series, the indices are 0, then 7, then 4 - they are not in any
# particular order, so Pandas cannot stop looking for the final index until it
# has found it.
my_series.loc[0:4]

0    31
7     5
4     6
dtype: int64

In [None]:
# Here is another example: start with index 0, and continue until we find index 3.
# However, in my_series, index 3 comes before index 0, so nothing is returned.
my_series.loc[0:3]

Series([], dtype: int64)

In [None]:
# Here is iloc. In this case, we are asking for the element at the 0th position
# inside the series. The index is ignored, and we get the 1 that is contained there.
print('my_series')
print(my_series)
print()
print()
print('output:')
my_series.iloc[0]

my_series
6       1
99      2
3       3
2       9
8     100
5      25
0      31
7       5
4       6
dtype: int64


output:


1

In [None]:
my_series

6       1
99      2
3       3
2       9
8     100
5      25
0      31
7       5
4       6
dtype: int64

In [None]:
# Slicing in iloc works like slicing on lists and arrays: it starts at the element
# at the 3rd position, and continues to but does not include the element at the
# 7th position:
my_series.iloc[3:7]

2      9
8    100
5     25
0     31
dtype: int64

In [None]:
# Slicing in iloc works like slicing on lists and arrays: it starts at the element
# at the 3rd position, and continues to but does not include the element at the
# 7th position:
my_series.loc[3:7]

3      3
2      9
8    100
5     25
0     31
7      5
dtype: int64

## DataFrame objects

DataFrame objects are like 2-dimensional NumPy arrays. They contain rows and colums, but unlike arrays, both the rows and columns can be labeled (or, more specifically, they have an index). Like Series objects, they are somewhere between dictionaries and lists.

In [None]:
# There are many ways to construct DataFrame objects. We will look at two methods
# in this notebook.
#
# DataFrames are essentially spreadsheets or tables. Here, we construct a 2-column
# DataFrame from two Series objects.
#
# Below, we see the resulting DataFrame contains two columns: one called "a" with
# the contents of series s1, and one called "b" with the contents of series s2.
s1 = pd.Series([100, 200, 300])
s2 = pd.Series([45, 2, 3])

pd.DataFrame({"a": s1, "b": s2})

Unnamed: 0,a,b
0,100,45
1,200,2
2,300,3


In [None]:
# Note that, when creating DataFrames this way, the indices of the input Series
# objects are very important. Let's try it again:
s1 = pd.Series([100, 200, 300], index=[10, 4, 33])
s2 = pd.Series([45, 2, 3], index=[10, 4, 33])

# We see that the row indices of the DataFrame match the row indices of the input
# Series objects.
pd.DataFrame({"a": s1, "b": s2})

Unnamed: 0,a,b
10,100,45
4,200,2
33,300,3


In [None]:
# What if the input Series objects only match on one index?
s1 = pd.Series([100, 200, 300], index=[10, 4, 33])
s2 = pd.Series([45, 2, 3], index=[10, 3, 21])

# Row 10 contains both values, since both input Series objects have values at
# index 10. But the remaining indices do not overlap, so the DataFrame contains
# missing values. We will discuss missing values in a later lecture.
pd.DataFrame({"a": s1, "b": s2})

Unnamed: 0,a,b
3,,2.0
4,200.0,
10,100.0,45.0
21,,3.0
33,300.0,


In [None]:
# DataFrame objects do not have to be created with Series objects. Here, we create
# a DataFrame from a NumPy array. Note that both the row indices and column names
# are numbers incrementing from 0, because there are no other names Pandas could
# have given them:
pd.DataFrame(np.array([
    [1, 12],
    [2, 13],
    [3, 14],
    [4, 15]
]))

Unnamed: 0,0,1
0,1,12
1,2,13
2,3,14
3,4,15


In [None]:
# In situations like this, we can specify our own row indices and column names:
pd.DataFrame(np.array([
    [1, 12],
    [2, 13],
    [3, 14],
    [4, 15]
]), index=[9, 84, 3, 2], columns=['a', 'b'])

Unnamed: 0,a,b
9,1,12
84,2,13
3,3,14
2,4,15


In [None]:
# We can use the subscript operator with DataFrames, too. It allows us to retrieve
# columns.
df = pd.DataFrame(np.array([
    [1, 12],
    [2, 13],
    [3, 14],
    [4, 15]
]), index=[9, 84, 3, 2], columns=['a', 'b'])

# Specifically, the example below returns the contents of column 'a' as a Series:
df['a']

9     1
84    2
3     3
2     4
Name: a, dtype: int64

In [None]:
# If we want to access both rows and columns, or if we want more fine-grained
# control over how we access data in the DataFrame, we can use loc and iloc here
# too. Both loc and iloc work like they do with Series objects. Specifically,
# they allow us to access rows.
#
# The semantics of loc and iloc are the same. the loc example below means "retrieve
# the row at index 9", and the iloc example means "retrieve the 0th row by position."
print("loc:")
print(df.loc[9])
print()
print("iloc:")
print(df.iloc[0])

loc:
a     1
b    12
Name: 9, dtype: int64

iloc:
a     1
b    12
Name: 9, dtype: int64


In [None]:
# We can request rows and columns.
# loc example: "Retrieve the row at index 9 and the column at index 'a'"
# iloc example: "Retrieve the row at position 0 and the column at position 0"
print("loc:")
print(df.loc[9, 'a'])
print()
print("iloc:")
print(df.iloc[0, 0])

loc:
1

iloc:
1


In [None]:
# We can request columns only.
# loc example: "Retrieve all rows and the column at index 'a'"
# iloc example: "Retrieve all rows and the column at position 0"
print("loc:")
print(df.loc[:, 'a'])
print()
print("iloc:")
print(df.iloc[:, 0])

loc:
9     1
84    2
3     3
2     4
Name: a, dtype: int64

iloc:
9     1
84    2
3     3
2     4
Name: a, dtype: int64


In [None]:
# Slicing works too:
# loc example: "Retrieve rows from index 84 to 3 and the column at index 'a'"
# iloc example: "Retrieve all rows from position 1 to position 3 exclusive and
#                the column at position 0"
print("loc:")
print(df.loc[84:3, 'a'])
print()
print("iloc:")
print(df.iloc[1:3, 0])

loc:


NameError: name 'df' is not defined

In [None]:
df

NameError: name 'df' is not defined

In [None]:
# You can slice rows and columns:
# loc example: "Retrieve rows from index 84 to 3 and columns from 'a' to 'b'"
# iloc example: "Retrieve all rows from position 1 to position 3 exclusive and
# the columns from position 0 to position 2 exclusive"
#
# Note that, unlike the examples above, the output is a DataFrame and not a Series.
# This is because the output includes multiple rows and multiple columns.
print("loc:")
display(df.loc[84:3, 'a':'b'])
print()
print("iloc:")
display(df.iloc[1:3, 0:2])

loc:


Unnamed: 0,a,b
84,2,13
3,3,14



iloc:


Unnamed: 0,a,b
84,2,13
3,3,14


## DataFrames and math

In [66]:
# Everything we know from math on NumPy arrays applies to DataFrame and Series
# objects too.
df = pd.DataFrame(np.array([
    [1, 12],
    [2, 13],
    [3, 14],
    [4, 15]
]), index=[9, 84, 3, 2], columns=['a', 'b'])
df

Unnamed: 0,a,b
9,1,12
84,2,13
3,3,14
2,4,15


In [67]:
# Adding single values
df + 100

Unnamed: 0,a,b
9,101,112
84,102,113
3,103,114
2,104,115


In [68]:
# Broadcasting accross rows
df + np.array([100, 1000])

Unnamed: 0,a,b
9,101,1012
84,102,1013
3,103,1014
2,104,1015


In [69]:
df

Unnamed: 0,a,b
9,1,12
84,2,13
3,3,14
2,4,15


In [70]:
# adding DataFrames
df + df

Unnamed: 0,a,b
9,2,24
84,4,26
3,6,28
2,8,30


In [75]:
df

Unnamed: 0,a,b,c
9,1,12,0.083333
84,2,13,0.153846
3,3,14,0.214286
2,4,15,0.266667


In [71]:
# By default, aggregations are performed over rows, leaving only columns behind
df.mean()

a     2.5
b    13.5
dtype: float64

In [72]:
# We can specify an axis for aggregation. By default, it is 0 (rows):
df.mean(axis=0)

a     2.5
b    13.5
dtype: float64

In [73]:
# However, we can request that aggregations are performed across columns
# (axis 1), leaving only rows behind (axis 0):
df.mean(axis=1)

9     6.5
84    7.5
3     8.5
2     9.5
dtype: float64

In [74]:
# Additionally, we can create new columns in the DataFrame by assigning new
# values to the column. This is commonly done during feature engineering, which
# is a process where we create new features from features already in the data:
df['c'] = df['a'] / df['b']
df

Unnamed: 0,a,b,c
9,1,12,0.083333
84,2,13,0.153846
3,3,14,0.214286
2,4,15,0.266667


## Working with Real Data

This section uses the Life Satisfaction dataset. The dataset contains the life satisfaction score of several countries, which provides an overall measurement of how satisfied its citizens are with their lives. It also includes GDP data of the country. Later in the class, we will discuss statistical techniques that allows us to determine whether the GDP and life satisfaction scores are related (in fact, they are - in general, the higher the GDP, the higher life satisfaction scores are!).

For now, we have to load data into Pandas before we can begin exploring it. This section shows how to read data in CSV format.

In [None]:
# Pandas's strenghths do not become apparent until you see how it works with
# DataFrames containing real data, not fake example data.
# To load real data into Pandas, use its read_csv() function.
#
# Note that this function works with files uploaded into Colab (or local files
# if you are using your own computer), or files from Web addresses that Pandas
# will download for you.
#
# Note that the .head() function causes Pandas to only show us the first 5 rows:
pd.read_csv("https://raw.githubusercontent.com/CUNY-CISC-3225/datasets/main/life_satisfaction/life_satisfaction_header.csv").head()

Unnamed: 0,country,gdp_per_capita,life_satisfaction
0,Brazil,8669.998,6.4
1,Mexico,9009.28,6.5
2,Russia,9054.914,5.8
3,Turkey,9437.372,5.5
4,Poland,12495.334,6.1


In [None]:
# Note that this function expects the file to be a specially formatted plain text
# file. You can visit the URL and see the structure of the file.
#  - One row per line
#  - Columns are separated by commas (CSV: "comma-separated values")
#  - The first line contains column names
#
# However, CSV files are often messy and there is no accepted standard for them.
# Occasionally, you might find CSV files with no header. Notice how Pandas
# interprets the first row as the header anyway:
pd.read_csv("https://raw.githubusercontent.com/CUNY-CISC-3225/datasets/main/life_satisfaction/life_satisfaction_noheader.csv").head()

Unnamed: 0,Brazil,8669.998,6.4
0,Mexico,9009.28,6.5
1,Russia,9054.914,5.8
2,Turkey,9437.372,5.5
3,Poland,12495.334,6.1
4,Latvia,13618.569,5.9


In [None]:
# We can fix this by specifying there is no header:
pd.read_csv(
    "https://raw.githubusercontent.com/CUNY-CISC-3225/datasets/main/life_satisfaction/life_satisfaction_noheader.csv",
    header=None
).head()

Unnamed: 0,0,1,2
0,Brazil,8669.998,6.4
1,Mexico,9009.28,6.5
2,Russia,9054.914,5.8
3,Turkey,9437.372,5.5
4,Poland,12495.334,6.1


In [None]:
# Some CSV files do not even use commas to separate columns. In this case, the file
# uses tab characters. Pandas expects comma separation, so it interprets the file as
# only having one column:
pd.read_csv("https://raw.githubusercontent.com/CUNY-CISC-3225/datasets/main/life_satisfaction/life_satisfaction.tsv").head()

Unnamed: 0,country\tgdp_per_capita\tlife_satisfaction
0,Brazil\t8669.998\t6.4
1,Mexico\t9009.28\t6.5
2,Russia\t9054.914\t5.8
3,Turkey\t9437.372\t5.5
4,Poland\t12495.334\t6.1


In [None]:
# Again, we can fix this with arguments to read_csv. We can specify a custom separator:
pd.read_csv(
    "https://raw.githubusercontent.com/CUNY-CISC-3225/datasets/main/life_satisfaction/life_satisfaction.tsv",
    sep='\t'
).head()

Unnamed: 0,country,gdp_per_capita,life_satisfaction
0,Brazil,8669.998,6.4
1,Mexico,9009.28,6.5
2,Russia,9054.914,5.8
3,Turkey,9437.372,5.5
4,Poland,12495.334,6.1


In [None]:
# Now that we have the data, we can try some more complex data queries:
df = pd.read_csv(
    "https://raw.githubusercontent.com/CUNY-CISC-3225/datasets/main/life_satisfaction/life_satisfaction_header.csv",
)

In [None]:
# What is the UK's life satisfaction?
df[df['country'] == 'United Kingdom']

Unnamed: 0,country,gdp_per_capita,life_satisfaction
26,United Kingdom,43770.688,6.8


In [None]:
# Countries with a life satisfaction less than 6:
df[df['life_satisfaction'] < 6]

Unnamed: 0,country,gdp_per_capita,life_satisfaction
2,Russia,9054.914,5.8
3,Turkey,9437.372,5.5
5,Latvia,13618.569,5.9
6,Lithuania,14210.28,5.9
9,Estonia,17288.083,5.7
10,Greece,18064.288,5.4
11,Portugal,19121.592,5.4
12,Slovenia,20732.482,5.9
14,Korea,27195.197,5.9
16,Japan,32485.545,5.9


In [None]:
# What is the average life satisfaction score?
df['life_satisfaction'].mean()

6.4655172413793105