# pandas lesson 

## Introduction

pandas is *the* library for data analysis in Python.  It has two data structures: 
* the Series for 1D labelled data such as a single row or column,
* the DataFrame for 2D data such as a table. 

A good place to get started with pandas is at https://pandas.pydata.org/getting_started.html

 This lesson shows examples of typical operations on a pandas DataFrame including:
* select a subset of columns
* calculate new columns
* filter rows in various ways
* group and summarise
* use the apply function and lambda functions


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # pandas uses matplotlib for plotting

## pandas Series (a 1D labelled array)

A Series is a 1D labelled array.  By default the labels are position-based integers, starting at 0.  Labels don't need to be unique.  The elements of a Series are usually of the same type. A Series may become a column in a dataframe (table) so we should expect this. These types include various types of number (ints and floats) and strings (objects).

We can create a Series in many ways, for example from a list.

### Create a Series

In [2]:
# Create a Series form a list of first names
first_names = pd.Series(['Harry', 'Hermione', 'Ron'])
first_names

0       Harry
1    Hermione
2         Ron
dtype: object

In [5]:
#  We can pass in an index when creating a Series
first_names = pd.Series(
    ['Harry', 'Hermione', 'Ron'], 
    index=['hp', 'hg', 'rw'])
first_names

hp       Harry
hg    Hermione
rw         Ron
dtype: object

In [6]:
# Create a Series with 4 random numbers drawn from the normal distribution with mean 10.
prices = pd.Series(np.random.randn(4), name = 'price') + 10 
prices

0     8.357914
1     9.520414
2    11.296184
3    10.392348
Name: price, dtype: float64

We can also change this index.  This is useful, for example, in time-series where a date may be the index.

In [7]:
prices.index = ["HSBC", "BP", "TSCO", "RDSA"] 
prices

HSBC     8.357914
BP       9.520414
TSCO    11.296184
RDSA    10.392348
Name: price, dtype: float64

Exercise: examine the prices and first_names Series.  
You may want to try these properties and methods: p.index, p.values, p.dtype, p.shape, p.ndim, p.size.

In [11]:
# Write your code here as a set of print statements. The first one is provided.
print("Index:", prices.index) 

Index: Index(['HSBC', 'BP', 'TSCO', 'RDSA'], dtype='object')


### Access elements in a Series

We can acccess elements of the Series 
* by position usng the iloc property,  or 
* by their index using the loc property. 

In [12]:
# returns the item in the 2nd position
first_names.iloc[1]

'Hermione'

In [14]:
# returns the item in the 2nd and 3rd positions
first_names.iloc[1:3] 

hg    Hermione
rw         Ron
dtype: object

In [15]:
#  Returns the element but using the index label
first_names.loc['rw']

'Ron'

Note that when slicing with loc, the syntax is inclusive (and not the usual Pythton syntax!).

In [16]:
first_names.loc['hp':'rw']

hp       Harry
hg    Hermione
rw         Ron
dtype: object

We can use the index to set values from the Series.

In [18]:
print("before change:", prices.loc['BP']) 
prices['BP'] = 5.0 # set a value in the Series
print("after change:", prices.loc['BP']) 


before change: 9.520414055964023
after change: 5.0


We can use *in* to see if the index value is in the Series

In [19]:
'TSCO' in prices # check if an index is in the Series

True

### Element wise operations

An element wise operation is one that is performed on every element of the Series. For example,  multiply all values by 100

In [20]:
prices * 100

HSBC     835.791396
BP       500.000000
TSCO    1129.618372
RDSA    1039.234816
Name: price, dtype: float64

Exercise: add 10 to each value in the prices Series.

In [None]:
# Write your code here

We can aggregate (e.g sum. average) the values in a Series either 
* using a numpy method, e.g. np.sum(prices)
* a method on the Series e.g. prices.sum()

In [21]:
np.sum(prices) # total value of all elements

np.float64(35.04644584323994)

In [22]:
prices.sum() 

np.float64(35.04644584323994)

Exercise: Find the min, max, average, median and other summary statistics of the prices Series

In [25]:
# Write your code here as a set of print statements. The first one is provided.
print(f"minimum: numpy method {np.min(prices)}, Series method {prices.min()}")



minimum: numpy method 5.0, Series method 5.0


## pandas DataFrames

A DataFrame is basically a table.  It is a 2D labelled data structure.  The columns can be different types.  You can think of it as a dict of Series objects (columns) if that helps.  Like a Series, a Dataframe has an index column.

We can build a pandas Dataframe in many ways, for example from a dict. The dict's keys become the column names and the dict's values become the column values.

In [26]:
fb_dict = {
        'id': ['MCY', 'LIV', 'TOT', 'CHE', 'ARL'],
        'city': ['Manchester',	'Liverpool', 'London', 'London', 'London'],
        'team':	['Manchester City', 'Liverpool', 'Tottenham Hotspur', 'Chelsea', 'Arsenal'],
        'champions_league': ['Yes', 'Yes', 'No', 'No', 'Yes'],
        'won':	[5, 6, 6, 5, 5],
        'drawn': [4, 1, 0, 2,0],
        'lost': [0, 0, 2, 0, 2],
        'form': ['DWWWW', 'WWWWD', 'LLWWW', 'WWWDD', 'WWWWW']
        }

fb = pd.DataFrame(fb_dict)

# set the index to the unique values of the 'id' column - more useful than 0,1,2...
fb = fb.set_index('id')
fb

Unnamed: 0_level_0,city,team,champions_league,won,drawn,lost,form
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
MCY,Manchester,Manchester City,Yes,5,4,0,DWWWW
LIV,Liverpool,Liverpool,Yes,6,1,0,WWWWD
TOT,London,Tottenham Hotspur,No,6,0,2,LLWWW
CHE,London,Chelsea,No,5,2,0,WWWDD
ARL,London,Arsenal,Yes,5,0,2,WWWWW


Exercise: Examine the dataframe.  
Use the following dataframe properties and methods: index, head(), describe(), shape, values, columns.

In [27]:
# Write your code here as a set of print statements. The first one is provided.
print("Index:", fb.index)

index: Index(['MCY', 'LIV', 'TOT', 'CHE', 'ARL'], dtype='object', name='id')


Typically, we create a dataframe by using a pandas method to load the data from a source.  In the example below, we load data from a CSV file on a public URL.

In [28]:
csv_file_url = "https://zomalextrainingstorage.blob.core.windows.net/datasets/misc/Churn.csv"
churn = pd.read_csv(csv_file_url)
churn.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


Exercise: examine the churn Dataframe

In [None]:
#  Write your code here

### Accessing columns in a Dataframe

We can access a columns or columns, col1, col2 in a dataframe df in several ways: 
* df['col1'] which returns a series
* df.col1 which returns a series (pandas allows us to refre to column names like a propeyty! )
* df[['col1']] or df[['col1', 'col2']] which return dataframes
* df.loc[ : , ['col1', 'col2']] which is possibly best practice and most flexible

Exercise: return the city and team columns in various ways

In [30]:
# Write your code here
print("Using dot notation:\n", fb.city)

Using dot notation:
 id
MCY    Manchester
LIV     Liverpool
TOT        London
CHE        London
ARL        London
Name: city, dtype: object


Use the unique method to get the distinct values of a column

In [31]:
fb.city.unique()

array(['Manchester', 'Liverpool', 'London'], dtype=object)

### Sort rows
Order the rows in the dataframe by the number of games won (low to high), then in the case of any ties, by the number of games drawn.

In [32]:
fb.sort_values(by = ['won', 'drawn'], ascending=True)

Unnamed: 0_level_0,city,team,champions_league,won,drawn,lost,form
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ARL,London,Arsenal,Yes,5,0,2,WWWWW
CHE,London,Chelsea,No,5,2,0,WWWDD
MCY,Manchester,Manchester City,Yes,5,4,0,DWWWW
TOT,London,Tottenham Hotspur,No,6,0,2,LLWWW
LIV,Liverpool,Liverpool,Yes,6,1,0,WWWWD


Exercise: sort rows by the form column

In [None]:
# Write your code here

### Create new columns

In [33]:
# Teams get 3 point for a win, 1 for a draw, none for a loss
fb['points'] = fb['won'] * 3 + fb['drawn']
fb

Unnamed: 0_level_0,city,team,champions_league,won,drawn,lost,form,points
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
MCY,Manchester,Manchester City,Yes,5,4,0,DWWWW,19
LIV,Liverpool,Liverpool,Yes,6,1,0,WWWWD,19
TOT,London,Tottenham Hotspur,No,6,0,2,LLWWW,18
CHE,London,Chelsea,No,5,2,0,WWWDD,17
ARL,London,Arsenal,Yes,5,0,2,WWWWW,15


**Exercise:** Create a new column, played, to hold the number of games played by each team.


In [None]:
# Write your code here

Add a new column team_caps. Note the way this is done.  The apply method argument is the string upper method

In [None]:
fb['team_caps'] = fb['team'].apply(str.upper)
fb

The next two statements have the same effect.  They both create a column with the first five characters of the team.  Which do you prefer?

In [None]:
# Use slicing
fb['team_short'] = fb['team'].str[:5]
fb

In [None]:
# Use a list comprehension
fb['team_short2'] = [x[0:5] for x in fb['team'] if len(x) > 5]
fb

In [None]:
# To keep the dataframe tidy, drop the new columns columns we added
fb = fb.drop(['team_caps', 'team_short', 'team_short2'], axis=1)
fb

#### Grouping

Group by city

In [None]:
fb_by_city = fb.groupby(['city'], as_index = False)
fb_by_city.groups

Get the totals of all the numeric columns per city

In [None]:
fb_by_city.sum(numeric_only=True)
fb_by_city[['points', 'won']].sum()


Exercise: group by those teams in (and not in) the Champions League (champions_league = 'Yes' or 'No')
Sum the won, drawn and lost  columms

In [None]:
# Write your code here

### Filter rows
Filter both rows, uins the index, and columns of the dataframe in various ways using the loc method

In [None]:
fb.loc['LIV', :] # one row, all columns, returns a series

In [None]:
fb.loc[['LIV'], :] # one row, all columns, returns a dataframe

In [None]:
fb.loc[['LIV', 'MCY'], :] #   two rows, all columns, returns a dataframe

In [None]:
fb.loc[['LIV', 'MCY'], ['team', 'won']] # two rows, two columns

In [None]:
fb.loc[:, ['city']] # all rows, one column

In [None]:
fb.loc[:, ['team', 'won']] # all rows, two columns

### Filter Rows by Values

We have already seen that a boolean expression returns a Series of bools

In [None]:
fb.won > 5


We can use this boolean expression to filter the rows of the dataframe where the bools are True

In [None]:
fb.loc[fb.won > 5, :]

Exercise: filter the fb dataframe to return only London teams

In [None]:
# Write your code here


We can combine two filter criteria with an & (and)

In [None]:
(fb.city == 'London') & (fb.won > 5)

Exercise:  filter the fb dataframe to return only London teams that have won more than 5 games

In [None]:
# Write your code here