#Introduction to Pandas

###Series

In [1]:
import numpy as np
import pandas as pd

A series is like an array in numpy, except it has data labels, so it's indexed. You can also treat a series as an ordered dictionary. 

In [3]:
series1 = pd.Series([3,6,9,12])
print(series1)

0     3
1     6
2     9
3    12
dtype: int64


In [4]:
# to see just the values of a series
series1.values

array([ 3,  6,  9, 12], dtype=int64)

In [5]:
# to see just the indices
series1.index

Int64Index([0, 1, 2, 3], dtype='int64')

In [8]:
# create a series of World War II casualities
ww2_casualities = pd.Series([8700000, 4300000, 3000000, 2100000, 400000], 
                            index = ['USSR', 'Germany', 'China', 'Japan', 'USA'])
print(ww2_casualities)

USSR       8700000
Germany    4300000
China      3000000
Japan      2100000
USA         400000
dtype: int64


In [9]:
# see USA casualities
ww2_casualities['USA']

400000

In [11]:
# check which countries have casualities > 4 million
ww2_casualities[ww2_casualities > 4000000]

USSR       8700000
Germany    4300000
dtype: int64

In [12]:
# check if an index or value is in a series
'USSR' in ww2_casualities

True

Since series behave similarly to dictionaries, you can actually convert a series into a dictionary. 

In [15]:
# convert a series to a dictionary
ww2_dict = ww2_casualities.to_dict()
print(ww2_dict)

{'Germany': 4300000, 'USSR': 8700000, 'China': 3000000, 'USA': 400000, 'Japan': 2100000}


In [17]:
# convert a dictionary into a series
ww2_series = pd.Series(ww2_dict)
print(ww2_series)

China      3000000
Germany    4300000
Japan      2100000
USA         400000
USSR       8700000
dtype: int64


In [20]:
# use a list to order a new series
countries = ['China', 'Germany', 'Japan', 'USA', 'USSR', 'Argentina']
series2 = pd.Series(ww2_dict, index=countries)
print(series2)

China        3000000
Germany      4300000
Japan        2100000
USA           400000
USSR         8700000
Argentina        NaN
dtype: float64


Since Argentina wasn't in our www_dict, it returns a NaN value.

In [21]:
# find nulls
pd.isnull(series2)

China        False
Germany      False
Japan        False
USA          False
USSR         False
Argentina     True
dtype: bool

In [22]:
# find values that aren't null
pd.notnull(series2)

China         True
Germany       True
Japan         True
USA           True
USSR          True
Argentina    False
dtype: bool

In [23]:
# we can add series together by indices
ww2_casualities + series2

Argentina         NaN
China         6000000
Germany       8600000
Japan         4200000
USA            800000
USSR         17400000
dtype: float64

In [26]:
# We can name our series too
ww2_casualities.name = 'World War II Casualities'
print(ww2_casualities)

USSR       8700000
Germany    4300000
China      3000000
Japan      2100000
USA         400000
Name: World War II Casualities, dtype: int64


In [27]:
# We can also label the indices
ww2_casualities.index.name = 'Countries'
print(ww2_casualities)

Countries
USSR       8700000
Germany    4300000
China      3000000
Japan      2100000
USA         400000
Name: World War II Casualities, dtype: int64


### Data Frames

To start working with data frames, we're going to need to get some data. We'll start off by using NFL win-loss records on wikipedia. We can use the webbrowser module to open this site, and then copy the column titles and first five rows of data onto the clipboard.

In [29]:
import webbrowser

In [30]:
website = 'http://en.wikipedia.org/wiki/NFL_win-loss_records'
webbrowser.open(website)

True

Pandas has a built-in way of reading data from the clipboard. So now that we've copied the data from wikipedia, we just have to create a data frame using that data. Note that when you do this, pandas automatically creates an index column for you.

In [34]:
nfl_frame = pd.read_clipboard()
nfl_frame

Unnamed: 0,Rank,Team,Won,Lost,Tied*,Pct.,First Season,Total Games,Conference
0,1,Dallas Cowboys,511,378,6,0.574,1960,894,NFC East
1,2,Chicago Bears,752,563,42,0.57,1920,1357,NFC North
2,3,Green Bay Packers,741,561,37,0.567,1921,1339,NFC North
3,4,Miami Dolphins,443,345,4,0.562,1966,792,AFC East
4,5,Baltimore Ravens,182,143,1,0.56,1996,326,AFC North


In [36]:
# see column names
nfl_frame.columns

Index(['Rank', 'Team', 'Won', 'Lost', 'Tied*', 'Pct.', 'First Season',
       'Total Games', 'Conference'],
      dtype='object')

Using the column names, you can return the index value and the column data. 

In [37]:
# pull team column
nfl_frame.Team

0       Dallas Cowboys
1        Chicago Bears
2    Green Bay Packers
3       Miami Dolphins
4     Baltimore Ravens
Name: Team, dtype: object

However, note that some column names have a space in them. For these columns, we can't call them in this fashion, so we have to call them like we would a dictionary key.

In [43]:
# pull total games column
nfl_frame['First Season']

0    1960
1    1920
2    1921
3    1966
4    1996
Name: First Season, dtype: int64

We can create a new data frame from our existing frame. 

In [45]:
pd.DataFrame(nfl_frame, columns = ['Team', 'First Season', 'Total Games'])

Unnamed: 0,Team,First Season,Total Games
0,Dallas Cowboys,1960,894
1,Chicago Bears,1920,1357
2,Green Bay Packers,1921,1339
3,Miami Dolphins,1966,792
4,Baltimore Ravens,1996,326


Now, what happens if we try and call a column that doesn't exist in the original data frame? One of the nice things about pandas is that it won't give you an error, it'll just fill in that column with NaN values. 

In [46]:
pd.DataFrame(nfl_frame, columns = ['Team', 'First Season', 
                                   'Total Games', 'Stadium'])

Unnamed: 0,Team,First Season,Total Games,Stadium
0,Dallas Cowboys,1960,894,
1,Chicago Bears,1920,1357,
2,Green Bay Packers,1921,1339,
3,Miami Dolphins,1966,792,
4,Baltimore Ravens,1996,326,


Now, let's look at how we can retrieve rows from our data frame.

To see the first few rows of a data frame, we can use the head(n) method, where n is the number of rows you want to see. I believe it defaults to 5. You can see the last few rows with the tail() method.

In [49]:
# see the top 3 rows
nfl_frame.head(3)

Unnamed: 0,Rank,Team,Won,Lost,Tied*,Pct.,First Season,Total Games,Conference
0,1,Dallas Cowboys,511,378,6,0.574,1960,894,NFC East
1,2,Chicago Bears,752,563,42,0.57,1920,1357,NFC North
2,3,Green Bay Packers,741,561,37,0.567,1921,1339,NFC North


In [50]:
# see the last 2 rows
nfl_frame.tail(2)

Unnamed: 0,Rank,Team,Won,Lost,Tied*,Pct.,First Season,Total Games,Conference
3,4,Miami Dolphins,443,345,4,0.562,1966,792,AFC East
4,5,Baltimore Ravens,182,143,1,0.56,1996,326,AFC North


In [51]:
# to retrieve rows based on an index, use the ix() method
nfl_frame.ix[3]

Rank                         4
Team            Miami Dolphins
Won                        443
Lost                       345
Tied*                        4
Pct.                     0.562
First Season              1966
Total Games                792
Conference            AFC East
Name: 3, dtype: object

We can also assign values to specific cells. For example, let's add a stadium column. 

In [54]:
nfl_frame['Stadium'] = 'Test Stadium'
nfl_frame

Unnamed: 0,Rank,Team,Won,Lost,Tied*,Pct.,First Season,Total Games,Conference,Stadium
0,1,Dallas Cowboys,511,378,6,0.574,1960,894,NFC East,Test Stadium
1,2,Chicago Bears,752,563,42,0.57,1920,1357,NFC North,Test Stadium
2,3,Green Bay Packers,741,561,37,0.567,1921,1339,NFC North,Test Stadium
3,4,Miami Dolphins,443,345,4,0.562,1966,792,AFC East,Test Stadium
4,5,Baltimore Ravens,182,143,1,0.56,1996,326,AFC North,Test Stadium


We can also assign values using other methods, like a list.

In [57]:
nfl_frame['Stadium'] = ['I', "Don't", 'Know', 'Any', 'Stadiums']
nfl_frame

Unnamed: 0,Rank,Team,Won,Lost,Tied*,Pct.,First Season,Total Games,Conference,Stadium
0,1,Dallas Cowboys,511,378,6,0.574,1960,894,NFC East,I
1,2,Chicago Bears,752,563,42,0.57,1920,1357,NFC North,Don't
2,3,Green Bay Packers,741,561,37,0.567,1921,1339,NFC North,Know
3,4,Miami Dolphins,443,345,4,0.562,1966,792,AFC East,Any
4,5,Baltimore Ravens,182,143,1,0.56,1996,326,AFC North,Stadiums


Or even using an array from numpy.

In [59]:
nfl_frame['Stadium'] = np.arange(5)
nfl_frame

Unnamed: 0,Rank,Team,Won,Lost,Tied*,Pct.,First Season,Total Games,Conference,Stadium
0,1,Dallas Cowboys,511,378,6,0.574,1960,894,NFC East,0
1,2,Chicago Bears,752,563,42,0.57,1920,1357,NFC North,1
2,3,Green Bay Packers,741,561,37,0.567,1921,1339,NFC North,2
3,4,Miami Dolphins,443,345,4,0.562,1966,792,AFC East,3
4,5,Baltimore Ravens,182,143,1,0.56,1996,326,AFC North,4


We can even do this using a series. One benefit to assigning using series is index matching. Pandas will use the index of the series to assign the column value to the row with the matching index. 

In [61]:
stadiums = pd.Series(["Levi's Stadium", "AT&T Stadium"], index = [4, 0])

In [69]:
nfl_frame['Stadium'] = stadiums
nfl_frame

Unnamed: 0,Rank,Team,Won,Lost,Tied*,Pct.,First Season,Total Games,Conference,Stadium
0,1,Dallas Cowboys,511,378,6,0.574,1960,894,NFC East,AT&T Stadium
1,2,Chicago Bears,752,563,42,0.57,1920,1357,NFC North,
2,3,Green Bay Packers,741,561,37,0.567,1921,1339,NFC North,
3,4,Miami Dolphins,443,345,4,0.562,1966,792,AFC East,
4,5,Baltimore Ravens,182,143,1,0.56,1996,326,AFC North,Levi's Stadium


We can even delete columns that we don't want.

In [70]:
# delete the stadium column
del nfl_frame['Stadium']

In [71]:
nfl_frame

Unnamed: 0,Rank,Team,Won,Lost,Tied*,Pct.,First Season,Total Games,Conference
0,1,Dallas Cowboys,511,378,6,0.574,1960,894,NFC East
1,2,Chicago Bears,752,563,42,0.57,1920,1357,NFC North
2,3,Green Bay Packers,741,561,37,0.567,1921,1339,NFC North
3,4,Miami Dolphins,443,345,4,0.562,1966,792,AFC East
4,5,Baltimore Ravens,182,143,1,0.56,1996,326,AFC North


Now, let's construct another data frame from a python dictionary. Pandas is automatically able to create a data frame from dictionaries, as long as the lengths match up for the items in the dictionary.

In [73]:
# create a dictionary
data = {'City':['SF', 'LA', 'NYC'], 'Population':[837000, 3880000, 8400000]}

In [76]:
# create a data frame from the data dictionary
city_frame = pd.DataFrame(data)
city_frame

Unnamed: 0,City,Population
0,SF,837000
1,LA,3880000
2,NYC,8400000


It's really useful to take a look at the pandas documentation to learn more about how to import data and what we can do with it once it's in pandas. 

http://pandas-docs.github.io/pandas-docs-travis/