This tutorial was written by [Won Hee Lee](https://wonhee-lee.github.io/) for SWCON425.

This current version has been created as a Jupyter notebook with Python3 for SWCON425, Data Science and Visualization.

# Pandas

## Some useful (free) resources

Introductory:

* [Getting started with Python for research](https://github.com/TiesdeKok/LearnPythonforResearch), a gentle introduction to Python in data-intensive research.

* [A Whirlwind Tour of Python](https://jakevdp.github.io/WhirlwindTourOfPython/index.html), by Jake VanderPlas, another quick Python intro (with notebooks).

Core Pandas/Data Science books:

* [The Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/), by Jake VanderPlas.

* [Python for Data Analysis, 2nd Edition](https://www.oreilly.com/library/view/python-for-data/9781491957653/), by  Wes McKinney, creator of Pandas. [Companion Notebooks](https://github.com/wesm/pydata-book)

* [Effective Pandas](https://github.com/TomAugspurger/effective-pandas), a book by Tom Augspurger, core Pandas developer.

OK, let's load and configure some of our core libraries (as an aside, you can find a nice visual gallery of available matplotlib styles [here](https://tonysyu.github.io/raw_content/matplotlib-style-gallery/gallery.html)).


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('display.precision', 2)

# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

## Reading in DataFrames from Files

Pandas has a number of very useful file reading tools. You can see them enumerated by typing "pd.re" and pressing tab. We'll be using read_csv today. 

In [2]:
# Write the code to read "elections.csv" file.

elections = pd.read_csv('elections.csv')
elections # if we end a cell with an expression or variable name, the result will print

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
...,...,...,...,...,...
20,Romney,Republican,47.2,2012,loss
21,Clinton,Democratic,48.2,2016,loss
22,Trump,Republican,46.1,2016,win


In [3]:
# write the code to print shape of 'elections'

elections.shape

(23, 5)

In [4]:
# write the code to print size of 'elections'

elections.size

115

We can use the head command to show only a few rows of a dataframe.

In [7]:
# write the code to print first five rows of 'elections'

elections.head()

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


There is also a tail command.

In [8]:
# write the code to print last five rows of 'elections'

elections.tail()

Unnamed: 0,Candidate,Party,%,Year,Result
18,McCain,Republican,45.7,2008,loss
19,Obama,Democratic,51.1,2012,win
20,Romney,Republican,47.2,2012,loss
21,Clinton,Democratic,48.2,2016,loss
22,Trump,Republican,46.1,2016,win


The read_csv command lets us specify a column to use an index. For example, we could have used Year as the index.

In [9]:
# Write the code to read "elections.csv" file while setting the "Year" column as the index

elections_year_index = pd.read_csv('elections.csv', index_col='Year')
elections_year_index.tail(5)

Unnamed: 0_level_0,Candidate,Party,%,Result
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2008,McCain,Republican,45.7,loss
2012,Obama,Democratic,51.1,win
2012,Romney,Republican,47.2,loss
2016,Clinton,Democratic,48.2,loss
2016,Trump,Republican,46.1,win


Alternately, we could have used the set_index commmand.

In [15]:
# write the code to print 'elections' that has "Party" as the index

elections_party_index = elections.set_index('Party')
elections_party_index

Unnamed: 0_level_0,Candidate,%,Year,Result
Party,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Republican,Reagan,50.7,1980,win
Democratic,Carter,41.0,1980,loss
Independent,Anderson,6.6,1980,loss
...,...,...,...,...
Republican,Romney,47.2,2012,loss
Democratic,Clinton,48.2,2016,loss
Republican,Trump,46.1,2016,win


In [11]:
# write the code to set "Party" as the index

elections_party_index = elections.set_index('Party')
elections_party_index.head(5)

Unnamed: 0_level_0,Candidate,%,Year,Result
Party,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Republican,Reagan,50.7,1980,win
Democratic,Carter,41.0,1980,loss
Independent,Anderson,6.6,1980,loss
Republican,Reagan,58.8,1984,win
Democratic,Mondale,37.6,1984,loss


By contast, column names are ideally unique. For example, if we try to read in a file for which column names are not unique, Pandas will automatically any duplicates.

In [16]:
# open the "duplicate_columns.csv" and check the column names
# compare the opened file and printed dataframe. what's different?

dups = pd.read_csv("duplicate_columns.csv")
dups

Unnamed: 0,name,name.1,flavor
0,john,smith,vanilla
1,zhang,shan,chocolate
2,fulan,alfulani,
3,hong,gildong,banana


### Tips

In [21]:
pd.read_csv("duplicate_columns.csv", header=None)

Unnamed: 0,0,1,2
0,name,name,flavor
1,john,smith,vanilla
2,zhang,shan,chocolate
3,fulan,alfulani,
4,hong,gildong,banana


## The [] Operator

The DataFrame class has an indexing operator [] that lets you do a variety of different things. If your provide a String to the [] operator, you get back a Series corresponding to the requested label.

In [17]:
elections

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
...,...,...,...,...,...
20,Romney,Republican,47.2,2012,loss
21,Clinton,Democratic,48.2,2016,loss
22,Trump,Republican,46.1,2016,win


In [23]:
# write the code to print first six "Candidate" in 'elections' using [ ] operation

elections['Candidate'][:6]

0      Reagan
1      Carter
2    Anderson
3      Reagan
4     Mondale
5        Bush
Name: Candidate, dtype: object

The [] operator also accepts a list of strings. In this case, you get back a DataFrame corresponding to the requested strings.

In [26]:
# write the code to print first six "Candidate" & "Party" columns in 'elections' using [ ] operation
# you can pass a list of strings for indexing

elections[['Candidate', 'Party']][:6]

Unnamed: 0,Candidate,Party
0,Reagan,Republican
1,Carter,Democratic
2,Anderson,Independent
3,Reagan,Republican
4,Mondale,Democratic
5,Bush,Republican


In [28]:
#var_name = 'Candidate'  # if single string without [], then we get series
var_name = ['Candidate'] # if List, then we get dataframe
elections[var_name]
#elections[['Candidate']]

Unnamed: 0,Candidate
0,Reagan
1,Carter
2,Anderson
...,...
20,Romney
21,Clinton
22,Trump


A list of one label also returns a DataFrame. This can be handy if you want your results as a DataFrame, not a series.

In [29]:
elections[["Candidate"]].head(10)

# compare with elections['Candidate'].head(10)
# velections['Candidate'].head(10)

Unnamed: 0,Candidate
0,Reagan
1,Carter
2,Anderson
...,...
7,Clinton
8,Bush
9,Perot


Note that we can also use the to_frame method to turn a Series into a DataFrame.

In [30]:
#elections["Candidate"] # returns series so 
# write the code to convert series to DataFrame with pandas method

elections["Candidate"].to_frame()

Unnamed: 0,Candidate
0,Reagan
1,Carter
2,Anderson
...,...
20,Romney
21,Clinton
22,Trump


The [] operator also accepts numerical slices as arguments. In this case, we are indexing by row, not column!

In [31]:
elections[0:3]

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss


If you provide a single argument to the [] operator, it tries to use it as a name. This is true even if the argument passed to [] is an integer. 

In [32]:
elections[0] #this does not work, try uncommenting this to see it fail in action, woo
# WHY?
# whenever you give elections a single thing (here 0), 
# a thing is getting the column
# it is trying to find the column with the label 0
# there is no such column in my dataframe, thus giving a key error

KeyError: 0

In [33]:
# Write the code to print first three rows of "Candidate" & "Party" columns in 'elections' without using head() method

elections[['Candidate', 'Party']][:3]

Unnamed: 0,Candidate,Party
0,Reagan,Republican
1,Carter,Democratic
2,Anderson,Independent


### Back to slides

The following cells allow you to test your understanding.

In [34]:
weird = pd.DataFrame({1:["topdog","botdog"], "1":["topcat","botcat"]})
weird

Unnamed: 0,1,1.1
0,topdog,topcat
1,botdog,botcat


In [35]:
weird[1] #try to predict the output 

0    topdog
1    botdog
Name: 1, dtype: object

In [36]:
weird["1"] #try to predict the output

0    topcat
1    botcat
Name: 1, dtype: object

In [37]:
weird[["1"]] #try to predict the output

Unnamed: 0,1
0,topcat
1,botcat


In [38]:
weird[1:] #try to predict the output

Unnamed: 0,1,1.1
1,botdog,botcat


## Boolean Array Selection

The `[]` operator also supports array of booleans as an input. In this case, the array must be exactly as long as the number of rows. The result is a filtered version of the data frame, where only rows corresponding to True appear.

In [39]:
elections[[False, False, False, False, False, 
          False, False, True, False, False,
          True, False, False, False, True,
          False, False, False, False, False,
          False, False, True]]

Unnamed: 0,Candidate,Party,%,Year,Result
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
14,Bush,Republican,47.9,2000,win
22,Trump,Republican,46.1,2016,win


In [42]:
#elections['Result']
elections['%']

0     50.7
1     41.0
2      6.6
      ... 
20    47.2
21    48.2
22    46.1
Name: %, Length: 23, dtype: float64

In [43]:
elections['%'] + 10    # like NumPy, we can do arithmetic

0     60.7
1     51.0
2     16.6
      ... 
20    57.2
21    58.2
22    56.1
Name: %, Length: 23, dtype: float64

One very common task in Data Science is filtering. Boolean Array Selection is one way to achieve this in Pandas. We start by observing logical operators like the equality operator can be applied to Pandas Series data to generate a Boolean Array. For example, we can compare the 'Result' column to the String 'win':

In [44]:
elections.head(5)

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


In [45]:
# like NumPy, we can do Boolean operations on series
# write the code to determine "Result" is 'win' in 'elections' and save results
 
iswin = elections['Result'] == 'win'
iswin.head(5)
iswin

0      True
1     False
2     False
      ...  
20    False
21    False
22     True
Name: Result, Length: 23, dtype: bool

In [46]:
# write the code to print 'elections' that "Result" is 'win' using iswin variable defined at above cell

elections[iswin]

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
...,...,...,...,...,...
17,Obama,Democratic,52.9,2008,win
19,Obama,Democratic,51.1,2012,win
22,Trump,Republican,46.1,2016,win


The output of the logical operator applied to the Series is another Series with the same name and index, but of datatype boolean. The entry with index i represents the result of the application of that operator to the entry of the original Series with index i.

In [47]:
# write the code to print 'elections' that "Party" is 'Independent' using a single line code

elections[elections['Party'] == 'Independent']

Unnamed: 0,Candidate,Party,%,Year,Result
2,Anderson,Independent,6.6,1980,loss
9,Perot,Independent,18.9,1992,loss
12,Perot,Independent,8.4,1996,loss


In [48]:
elections['Result'].head(5)

0     win
1    loss
2    loss
3     win
4    loss
Name: Result, dtype: object

In [49]:
elections.loc[iswin]

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
...,...,...,...,...,...
17,Obama,Democratic,52.9,2008,win
19,Obama,Democratic,51.1,2012,win
22,Trump,Republican,46.1,2016,win


Above, we've assigned the result of the logical operator to a new variable called `iswin`. This is uncommon. Usually, the series is created and used on the same line. Such code is a little tricky to read at first, but you'll get used to it quickly.

We can select multiple criteria by creating multiple boolean Series and combining them using the `&` operator.

In [50]:
elections[(elections['Result'] == 'win')]

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
...,...,...,...,...,...
17,Obama,Democratic,52.9,2008,win
19,Obama,Democratic,51.1,2012,win
22,Trump,Republican,46.1,2016,win


In [51]:
elections[(elections['%'] < 50)]    

Unnamed: 0,Candidate,Party,%,Year,Result
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
4,Mondale,Democratic,37.6,1984,loss
...,...,...,...,...,...
20,Romney,Republican,47.2,2012,loss
21,Clinton,Democratic,48.2,2016,loss
22,Trump,Republican,46.1,2016,win


In [52]:
# '&' = 'and' = ampersand    '|' = 'or' = pipe
# write the code to print 'elections' that election winners with under 50% 

elections[(elections['%'] < 50) & (elections['Result'] == 'win')]

Unnamed: 0,Candidate,Party,%,Year,Result
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
14,Bush,Republican,47.9,2000,win
22,Trump,Republican,46.1,2016,win


## loc and iloc

In [55]:
row = 2
col = 'Candidate'
elections.loc[row, col]

'Anderson'

In [56]:
elections.head(5)

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


In [57]:
#elections.loc[[0, 1, 2, 3, 4],:]
elections.loc[[0, 1, 2, 3, 4], ['Candidate']]

Unnamed: 0,Candidate
0,Reagan
1,Carter
2,Anderson
3,Reagan
4,Mondale


Loc also supports slicing (for all types, including numeric and string labels!). Note that the slicing for loc is **inclusive**, even for numeric slices.

In [58]:
elections.loc[0:4, 'Candidate']  # loc, always right inclusive

0      Reagan
1      Carter
2    Anderson
3      Reagan
4     Mondale
Name: Candidate, dtype: object

In [67]:
elections['Candidate'][0:5]   # numeric slices, right exclusive

0      Reagan
1      Carter
2    Anderson
3      Reagan
4     Mondale
Name: Candidate, dtype: object

In [60]:
elections.loc[0:6, 'Candidate':'Party']

Unnamed: 0,Candidate,Party
0,Reagan,Republican
1,Carter,Democratic
2,Anderson,Independent
3,Reagan,Republican
4,Mondale,Democratic
5,Bush,Republican
6,Dukakis,Democratic


If we provide only a single label for the column argument, we get back a Series.

In [61]:
elections.loc[0:4, 'Candidate']

0      Reagan
1      Carter
2    Anderson
3      Reagan
4     Mondale
Name: Candidate, dtype: object

If we want a data frame instead and don't want to use to_frame, we can provde a list containing the column name.

In [62]:
elections.loc[0:4, ['Candidate']] # provide a list

Unnamed: 0,Candidate
0,Reagan
1,Carter
2,Anderson
3,Reagan
4,Mondale


If we give only one row but many column labels, we'll get back a Series corresponding to a row of the table. This new Series has a neat index, where each entry is the name of the column that the data came from.

In [64]:
elections.loc[0, 'Candidate':'Year']

Candidate        Reagan
Party        Republican
%                    51
Year               1980
Name: 0, dtype: object

In [65]:
elections.loc[[0], 'Candidate':'Year']

Unnamed: 0,Candidate,Party,%,Year
0,Reagan,Republican,50.7,1980


In [68]:
#elections.loc[:,:] # meaning all the rows and columns
elections.loc[1,:] 

Candidate        Carter
Party        Democratic
%                    41
Year               1980
Result             loss
Name: 1, dtype: object

If we omit the column argument altogether, the default behavior is to retrieve all columns. 

In [69]:
elections.loc[[2, 4, 5]]

Unnamed: 0,Candidate,Party,%,Year,Result
2,Anderson,Independent,6.6,1980,loss
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win


Loc also supports boolean array inputs instead of labels. If the arrays are too short, loc assumes the missing values are False.

In [70]:
elections.loc[[0, 3], ['Candidate', 'Year']]

Unnamed: 0,Candidate,Year
0,Reagan,1980
3,Reagan,1984


Boolean Series are also boolean arrays, so we can use the Boolean Array Selection from earlier using loc as well.

In [71]:
# write the code to print the "Candidate ~ Year" of 'elections' that election winners with under 50% using loc method

elections.loc[(elections['%'] < 50) & (elections['Result'] == 'win'), 'Candidate':'Year']

Unnamed: 0,Candidate,Party,%,Year
7,Clinton,Democratic,43.0,1992
10,Clinton,Democratic,49.2,1996
14,Bush,Republican,47.9,2000
22,Trump,Republican,46.1,2016


### iloc

loc's cousin iloc is very similar, but is used to access based on numerical position instead of label. For example, to access to the top 3 rows and top 3 columns of a table, we can use [0:3, 0:3]. iloc slicing is **exclusive**, just like standard Python slicing of numerical values.

In [72]:
elections.head(5)

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


In [73]:
elections.iloc[0:3, 0:3]  # exclusive, whereas loc is inclusive

Unnamed: 0,Candidate,Party,%
0,Reagan,Republican,50.7
1,Carter,Democratic,41.0
2,Anderson,Independent,6.6


What we've done so far is NOT exploratory data analysis. We were just playing around a bit with the capabilities of the pandas library. 