# pandas lesson 

Objective: Wrangle Data with pandas package.  Create a dataset, add new columns, select columns, filter rows group and aggregate.

 A pandas Dataframe is a 2D object (table).
 A pandas Series is a 1D labelled object.
 pandas rows have unique labels (a strange concept to SQL people).

 This lesson shows examples of typical operations on a pandas dataframe including:
* calculating new columns
* filtering rows in various ways
* grouping and summarising 
* using the apply function and lambda functions
* creating simple plots

In [78]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # pandas uses matplotlib for plotting

### pandas Series

Create a series

In [79]:

p = pd.Series(np.random.randn(4), name = 'price')
p

0   -0.114905
1   -0.367645
2    1.006557
3   -1.397824
Name: price, dtype: float64

Exercise: examine the series.  You may want to try these properties and methods: p.index, p.values, p.dtype, p.shape, p.ndim, p.size

In [80]:
# Write your code here

Perform an element-wise operation e.g. multiply all values by 100

In [81]:
p * 100

0    -11.490502
1    -36.764507
2    100.655700
3   -139.782370
Name: price, dtype: float64

Exercise: add 10 to each value in the Series.

In [82]:
# Write your code here

We can aggregate (e.g sum. average) the values in a Series either using a numpy method or a metjod on the Series 

In [83]:
np.sum(p) # total value of all elements

-0.87381677934117

In [84]:
p.sum() 

-0.87381677934117

Exercise: Find the min, max, median and other summary statistics of the Series

In [85]:
# Write your code here


A Series and Dataframe have a (row) index, by default this is 0,1,2,...  We can change this index.  This is useful, for example, in time-series where a date may be the index.

In [86]:
p.index = ["HSBC", "BP", "TSCO", "RDSA"] 
p

HSBC   -0.114905
BP     -0.367645
TSCO    1.006557
RDSA   -1.397824
Name: price, dtype: float64

We can use the index to get or set the corresponding value in the Series

In [87]:
p['BP'] # get a value in the Series


-0.3676450666491596

In [88]:
p['BP'] = 5.0 # set a value in the Series
p['BP']

5.0

We can use *in* to see if the index contains a particular key

In [89]:
'TSCO' in p 

True

### pandas Dataframes

We can build a pandas dataframe from a dict. The dict's keys become the column names and the dict's values become the column values.

In [90]:
fb_dict = {
        'id' : ['MCY', 'LIV', 'TOT', 'CHE', 'ARL'],
        'city' :	['Manchester',	'Liverpool', 'London', 'London', 'London'],
        'team' :	['Manchester City', 'Liverpool', 'Tottenham Hotspur', 'Chelsea', 'Arsenal'],
        'champions_league' : ['Yes', 'Yes', 'No', 'No', 'Yes'],
        'won' :	[5, 6, 6, 5, 5],
        'drawn' : [4, 1, 0, 2,0],
        'lost' : [0, 0, 2, 0, 2],
        'form' : ['DWWWW', 'WWWWD', 'LLWWW', 'WWWDD', 'WWWWW']
        }

fb_df = pd.DataFrame(fb_dict)

# set the index to the unique values of the 'id' column - more useful than 0,1,2...
fb = fb_df.set_index('id')
fb

Unnamed: 0_level_0,city,team,champions_league,won,drawn,lost,form
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
MCY,Manchester,Manchester City,Yes,5,4,0,DWWWW
LIV,Liverpool,Liverpool,Yes,6,1,0,WWWWD
TOT,London,Tottenham Hotspur,No,6,0,2,LLWWW
CHE,London,Chelsea,No,5,2,0,WWWDD
ARL,London,Arsenal,Yes,5,0,2,WWWWW


Exercise: Examine the dataframe.  Use the following dataframe properties and methods: index, head(), describe(), shape, values, columns

In [91]:
# Write your code here

We can access a columns or columns, col1, col2 in a dataframe df using 
* df['col1'] which returns a series
* df.col1 which returns a series
* df[['col1']] or df[['col1', 'col2']] which return dataframes

Exercise: return the city and team columns in various ways

In [92]:
# Write your code here


Use the unique method to get the distinct values of a column

In [93]:
fb.city.unique()

array(['Manchester', 'Liverpool', 'London'], dtype=object)

### Sort rows
Order the rows in the dataframe by the number of games won (low to high), then in the case of any ties, by the number of games drawn.

In [94]:
fb.sort_values(by = ['won', 'drawn'], ascending=True)

Unnamed: 0_level_0,city,team,champions_league,won,drawn,lost,form
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ARL,London,Arsenal,Yes,5,0,2,WWWWW
CHE,London,Chelsea,No,5,2,0,WWWDD
MCY,Manchester,Manchester City,Yes,5,4,0,DWWWW
TOT,London,Tottenham Hotspur,No,6,0,2,LLWWW
LIV,Liverpool,Liverpool,Yes,6,1,0,WWWWD


Exercise: sort rows by the form column

In [95]:
# Write your code here

### Create new columns

In [96]:
# Teams get 3 point for a win, 1 for a draw, none for a loss
fb['points'] = fb['won'] * 3 + fb['drawn']
fb

Unnamed: 0_level_0,city,team,champions_league,won,drawn,lost,form,points
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
MCY,Manchester,Manchester City,Yes,5,4,0,DWWWW,19
LIV,Liverpool,Liverpool,Yes,6,1,0,WWWWD,19
TOT,London,Tottenham Hotspur,No,6,0,2,LLWWW,18
CHE,London,Chelsea,No,5,2,0,WWWDD,17
ARL,London,Arsenal,Yes,5,0,2,WWWWW,15


**Exercise:** Create a new column, played, to hold the number of games played by each team.


In [97]:
# Write your code here

Add a new column team_caps. Note the way this is done.  The apply method argument is the string upper method

In [98]:
fb['team_caps'] = fb['team'].apply(str.upper)
fb

Unnamed: 0_level_0,city,team,champions_league,won,drawn,lost,form,points,team_caps
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
MCY,Manchester,Manchester City,Yes,5,4,0,DWWWW,19,MANCHESTER CITY
LIV,Liverpool,Liverpool,Yes,6,1,0,WWWWD,19,LIVERPOOL
TOT,London,Tottenham Hotspur,No,6,0,2,LLWWW,18,TOTTENHAM HOTSPUR
CHE,London,Chelsea,No,5,2,0,WWWDD,17,CHELSEA
ARL,London,Arsenal,Yes,5,0,2,WWWWW,15,ARSENAL


The next two statements have the same effect.  They both create a column with the first five characters of the team.  Which do you prefer?

In [99]:
# Use slicing
fb['team_short'] = fb['team'].str[:5]
fb

Unnamed: 0_level_0,city,team,champions_league,won,drawn,lost,form,points,team_caps,team_short
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
MCY,Manchester,Manchester City,Yes,5,4,0,DWWWW,19,MANCHESTER CITY,Manch
LIV,Liverpool,Liverpool,Yes,6,1,0,WWWWD,19,LIVERPOOL,Liver
TOT,London,Tottenham Hotspur,No,6,0,2,LLWWW,18,TOTTENHAM HOTSPUR,Totte
CHE,London,Chelsea,No,5,2,0,WWWDD,17,CHELSEA,Chels
ARL,London,Arsenal,Yes,5,0,2,WWWWW,15,ARSENAL,Arsen


In [100]:
# Use a list comprehension
fb['team_short2'] = [x[0:5] for x in fb['team'] if len(x) > 5]
fb

Unnamed: 0_level_0,city,team,champions_league,won,drawn,lost,form,points,team_caps,team_short,team_short2
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
MCY,Manchester,Manchester City,Yes,5,4,0,DWWWW,19,MANCHESTER CITY,Manch,Manch
LIV,Liverpool,Liverpool,Yes,6,1,0,WWWWD,19,LIVERPOOL,Liver,Liver
TOT,London,Tottenham Hotspur,No,6,0,2,LLWWW,18,TOTTENHAM HOTSPUR,Totte,Totte
CHE,London,Chelsea,No,5,2,0,WWWDD,17,CHELSEA,Chels,Chels
ARL,London,Arsenal,Yes,5,0,2,WWWWW,15,ARSENAL,Arsen,Arsen


In [101]:
# To keep the dataframe tidy, drop the new columns columns we added
fb = fb.drop(['team_caps', 'team_short', 'team_short2'], axis=1)
fb

Unnamed: 0_level_0,city,team,champions_league,won,drawn,lost,form,points
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
MCY,Manchester,Manchester City,Yes,5,4,0,DWWWW,19
LIV,Liverpool,Liverpool,Yes,6,1,0,WWWWD,19
TOT,London,Tottenham Hotspur,No,6,0,2,LLWWW,18
CHE,London,Chelsea,No,5,2,0,WWWDD,17
ARL,London,Arsenal,Yes,5,0,2,WWWWW,15


#### Grouping

Group by city

In [102]:
fb_by_city = fb.groupby(['city'], as_index = False)
fb_by_city.groups

{'Liverpool': ['LIV'], 'London': ['TOT', 'CHE', 'ARL'], 'Manchester': ['MCY']}

Get the totals of all the numeric columns per city

In [103]:
fb_by_city.sum(numeric_only=True)
fb_by_city[['points', 'won']].sum()


Unnamed: 0,city,points,won
0,Liverpool,19,6
1,London,50,16
2,Manchester,19,5


Exercise: group by those teams in (and not in) the Champions League (champions_league = 'Yes' or 'No')
Sum the won, drawn and lost  columms

In [104]:
# Write your code here

### Filter rows
Filter both rows, uins the index, and columns of the dataframe in various ways using the loc method

In [105]:
fb.loc['LIV', :] # one row, all columns, returns a series

city                Liverpool
team                Liverpool
champions_league          Yes
won                         6
drawn                       1
lost                        0
form                    WWWWD
points                     19
Name: LIV, dtype: object

In [106]:
fb.loc[['LIV'], :] # one row, all columns, returns a dataframe

Unnamed: 0_level_0,city,team,champions_league,won,drawn,lost,form,points
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
LIV,Liverpool,Liverpool,Yes,6,1,0,WWWWD,19


In [107]:
fb.loc[['LIV', 'MCY'], :] #   two rows, all columns, returns a dataframe

Unnamed: 0_level_0,city,team,champions_league,won,drawn,lost,form,points
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
LIV,Liverpool,Liverpool,Yes,6,1,0,WWWWD,19
MCY,Manchester,Manchester City,Yes,5,4,0,DWWWW,19


In [108]:
fb.loc[['LIV', 'MCY'], ['team', 'won']] # two rows, two columns

Unnamed: 0_level_0,team,won
id,Unnamed: 1_level_1,Unnamed: 2_level_1
LIV,Liverpool,6
MCY,Manchester City,5


In [109]:
fb.loc[:, ['city']] # all rows, one column

Unnamed: 0_level_0,city
id,Unnamed: 1_level_1
MCY,Manchester
LIV,Liverpool
TOT,London
CHE,London
ARL,London


In [110]:
fb.loc[:, ['team', 'won']] # all rows, two columns

Unnamed: 0_level_0,team,won
id,Unnamed: 1_level_1,Unnamed: 2_level_1
MCY,Manchester City,5
LIV,Liverpool,6
TOT,Tottenham Hotspur,6
CHE,Chelsea,5
ARL,Arsenal,5


### Filter Rows by Values

We have already seen that a boolean expression returns a Series of bools

In [111]:
fb.won > 5


id
MCY    False
LIV     True
TOT     True
CHE    False
ARL    False
Name: won, dtype: bool

We can use this boolean expression to filter the rows of the dataframe where the bools are True

In [112]:
fb.loc[fb.won > 5, :]

Unnamed: 0_level_0,city,team,champions_league,won,drawn,lost,form,points
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
LIV,Liverpool,Liverpool,Yes,6,1,0,WWWWD,19
TOT,London,Tottenham Hotspur,No,6,0,2,LLWWW,18


Exercise: filter the fb dataframe to return only London teams

In [113]:
# Write your code here


We can combine two filter criteria with an & (and)

In [114]:
(fb.city == 'London') & (fb.won > 5)

id
MCY    False
LIV    False
TOT     True
CHE    False
ARL    False
dtype: bool

Exercise:  filter the fb dataframe to return only London teams that have won more than 5 games

In [115]:
# Write your code here