## Data Science Bootcamp or Data Analyst Apprenticeship - see Slack Channel

### Pandas ("panel data")

The Pandas package is used to explore data, e.g.,
 - To calculate statistical values (mean, max, min etc.)
 - To clean data (remove outliers, account for missing data etc.)
 - To restructure data into a nicer format
 - To prepare data for futher processing, e.g., for classification, clustering, visualization etc.

Its data manipulation syntax resembles that of Numpy array operations. The Pandas package is already installed as part of the Anaconda installation. If this is not your case, you first need to install it on your computer before switching to the exercises.

### What are pandas - series and dataframes

Below is a short example of a 'Series' and 'DataFrame'. We won't worry too much about creating this from scratch since typically, you will use Pandas for reading in data from a given file.

#### Pandas Series
A series is effectively a column of data.

In [1]:
import pandas as pd   # use the Pandas package and refer to it as 'pd' to save writing out

data_Series = [12, 8, 10, 11]

attendance = pd.Series(data_Series)

print(attendance)

0    12
1     8
2    10
3    11
dtype: int64


We could make the labelling a bit more helpful.

In [2]:
# Add some row labels

attendance_labelled = pd.Series(data_Series, index = ['week_1', 'week_2', 'week_3', 'week_4'])

print(attendance_labelled)

week_1    12
week_2     8
week_3    10
week_4    11
dtype: int64


#### Pandas DataFrame

A DataFrame is simply a collection of multiple Series.

In [3]:
import pandas as pd   

data_df = {'Python' : [12, 8, 10, 11], 'Web101' : [6, 10, 8, 11], 'AI' : [11, 7, 6, 9]}

profiles = pd.DataFrame(data_df, index = ['week_1', 'week_2', 'week_3', 'week_4'])

print(profiles)

        Python  Web101  AI
week_1      12       6  11
week_2       8      10   7
week_3      10       8   6
week_4      11      11   9


Note that if you don't use "print", the DataFrame will be printed as a table.

In [4]:
profiles

Unnamed: 0,Python,Web101,AI
week_1,12,6,11
week_2,8,10,7
week_3,10,8,6
week_4,11,11,9


### Reading in data

We will use a local database to give you a more realistic picture of what to expect when working on a typical assignment with data manipulation, but also to promote interesting datasets. The database was chosen on the basis that it is quite small, so it is convenient for your first exercises with Pandas.
More interesting databases can be found here: https://opendata.bristol.gov.uk/pages/homepage/

Let us first read the data from the file and see what type it is.

In [5]:
# Note that all the import statements should go at the top of a script
import os
import pandas as pd

input_file = os.path.join(os.path.abspath(''), 'datafiles', 'public-toilets-community.csv')

toilet_data = pd.read_csv(input_file)

print(type(toilet_data))

<class 'pandas.core.frame.DataFrame'>


Since the data is already of type DataFrame, we can easily print it as a table and check what information it contains.

In [6]:
toilet_data

Unnamed: 0,OBJECTID,Name,Address,Postcode,Ward,Toilet opening hours,Male (no. of),Urinals (no. of),Female (no. of),Unisex (no.of),...,Automatic opening door,Pull cord monitored (Y/N/Times),Radar Key,Baby change,Family Toilet,Automatic Public Convenience,Attended,Full Time Staffing,geo_shape,geo_point_2d
0,52,Tesco - Callington Road,"Callington Road, BS4 5AY",BS4 5AY,Brislington West,"Mon 8am then 24 hours until 10pm on Sats, Sun ...",Yes,0,Yes,0.0,...,N,Yes- during open times,N,Y,N,N,,N,"{""type"": ""Point"", ""coordinates"": [-2.554650233...","51.4324732718, -2.55465023322"
1,29,Sea Walls Public Toilets,Circular Road By Sea Walls viewing Area,BS9 1ZZ,Stoke Bishop,Monday - Saunday 7:30-18:00,2,1,3,,...,N,N,Y,Y,N,N,N,N,"{""type"": ""Point"", ""coordinates"": [-2.635545426...","51.4701425495, -2.63554542692"
2,54,Avonmouth Medical Centre (Pioneer Medical Group),"1 Collins Street, Bristol",BS11 9JJ,Avonmouth & Lawrence Weston,"Monday-Thursday 08:00 - 18:30, Friday 08:00-17:00",0,0,0,2.0,...,N,Y during open times,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.694532865...","51.4987516075, -2.69453286599"
3,2081,Ridingleaze Medical Centre (Pioneer Medical Gr...,"Ridingleaze, Lawrence Weston",BS11 0QE,Avonmouth & Lawrence Weston,Monday- Friday 08:00 - 18:30,0,0,0,1.0,...,N,no,N,N,N,N,,,"{""type"": ""Point"", ""coordinates"": [-2.659817532...","51.5004455574, -2.65981753253"
4,57,Barton Hill Settlement,"41-43 Ducie Road, BS5 0AX",BS5 0AX,Lawrence Hill,Monday - Friday 9:00 - 16:00,2,,2,0.0,...,N,Yes- during open time,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.563674974...","51.4562494756, -2.5636749747"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102,641,"Ambition Lawrence Weston, Lawrence Weston Yout...","Lawrence Weston Road, Long Cross, Bristol",BS11 0RX,Avonmouth & Lawrence Weston,Monday to Friday 09:00-16:00 with ad-hoc addit...,,,,,...,,,,,,,,,"{""type"": ""Point"", ""coordinates"": [-2.653822353...","51.5049424918, -2.65382235353"
103,1745,Helios - Sea Mills Surgery,2 Riverleaze,BS9 2HL,Stoke Bishop,Monday-Friday 08:00-18:30,0,0,0,1.0,...,Y,yes during opening hours,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.650086369...","51.4849119244, -2.65008636941"
104,2417,Merrywood Practice,"Knowle Health Park, Downton Road",BS4 1WH,Filwood,08:30-18:30 Monday-Friday,2,0,2,0.0,...,Y,yes,N,Y,N,N,,Y,"{""type"": ""Point"", ""coordinates"": [-2.596459550...","51.4280468206, -2.59645955096"
105,70,Morrison's - Hartcliffe,"Hareclive Rd, Petersons Avenue, Hartcliffe",BS14 0BE,Hartcliffe & Withywood,7am - 10pm,2,3,3,,...,,Y during open hours,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.599634281...","51.405200404, -2.59963428185"


In [7]:
toilet_df = pd.DataFrame(toilet_data)
toilet_df

Unnamed: 0,OBJECTID,Name,Address,Postcode,Ward,Toilet opening hours,Male (no. of),Urinals (no. of),Female (no. of),Unisex (no.of),...,Automatic opening door,Pull cord monitored (Y/N/Times),Radar Key,Baby change,Family Toilet,Automatic Public Convenience,Attended,Full Time Staffing,geo_shape,geo_point_2d
0,52,Tesco - Callington Road,"Callington Road, BS4 5AY",BS4 5AY,Brislington West,"Mon 8am then 24 hours until 10pm on Sats, Sun ...",Yes,0,Yes,0.0,...,N,Yes- during open times,N,Y,N,N,,N,"{""type"": ""Point"", ""coordinates"": [-2.554650233...","51.4324732718, -2.55465023322"
1,29,Sea Walls Public Toilets,Circular Road By Sea Walls viewing Area,BS9 1ZZ,Stoke Bishop,Monday - Saunday 7:30-18:00,2,1,3,,...,N,N,Y,Y,N,N,N,N,"{""type"": ""Point"", ""coordinates"": [-2.635545426...","51.4701425495, -2.63554542692"
2,54,Avonmouth Medical Centre (Pioneer Medical Group),"1 Collins Street, Bristol",BS11 9JJ,Avonmouth & Lawrence Weston,"Monday-Thursday 08:00 - 18:30, Friday 08:00-17:00",0,0,0,2.0,...,N,Y during open times,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.694532865...","51.4987516075, -2.69453286599"
3,2081,Ridingleaze Medical Centre (Pioneer Medical Gr...,"Ridingleaze, Lawrence Weston",BS11 0QE,Avonmouth & Lawrence Weston,Monday- Friday 08:00 - 18:30,0,0,0,1.0,...,N,no,N,N,N,N,,,"{""type"": ""Point"", ""coordinates"": [-2.659817532...","51.5004455574, -2.65981753253"
4,57,Barton Hill Settlement,"41-43 Ducie Road, BS5 0AX",BS5 0AX,Lawrence Hill,Monday - Friday 9:00 - 16:00,2,,2,0.0,...,N,Yes- during open time,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.563674974...","51.4562494756, -2.5636749747"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102,641,"Ambition Lawrence Weston, Lawrence Weston Yout...","Lawrence Weston Road, Long Cross, Bristol",BS11 0RX,Avonmouth & Lawrence Weston,Monday to Friday 09:00-16:00 with ad-hoc addit...,,,,,...,,,,,,,,,"{""type"": ""Point"", ""coordinates"": [-2.653822353...","51.5049424918, -2.65382235353"
103,1745,Helios - Sea Mills Surgery,2 Riverleaze,BS9 2HL,Stoke Bishop,Monday-Friday 08:00-18:30,0,0,0,1.0,...,Y,yes during opening hours,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.650086369...","51.4849119244, -2.65008636941"
104,2417,Merrywood Practice,"Knowle Health Park, Downton Road",BS4 1WH,Filwood,08:30-18:30 Monday-Friday,2,0,2,0.0,...,Y,yes,N,Y,N,N,,Y,"{""type"": ""Point"", ""coordinates"": [-2.596459550...","51.4280468206, -2.59645955096"
105,70,Morrison's - Hartcliffe,"Hareclive Rd, Petersons Avenue, Hartcliffe",BS14 0BE,Hartcliffe & Withywood,7am - 10pm,2,3,3,,...,,Y during open hours,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.599634281...","51.405200404, -2.59963428185"


Note that the printed table doesn't contain all rows, but only the first and last 5 out of a total of 107 rows.

Let's say we are only interested in quickly checking the first rows of the dataset. We can use x.head(n) to print the first *n* rows, where *x* is the DataFrame variable (by default, n=5).

In [8]:
toilet_data.head()

Unnamed: 0,OBJECTID,Name,Address,Postcode,Ward,Toilet opening hours,Male (no. of),Urinals (no. of),Female (no. of),Unisex (no.of),...,Automatic opening door,Pull cord monitored (Y/N/Times),Radar Key,Baby change,Family Toilet,Automatic Public Convenience,Attended,Full Time Staffing,geo_shape,geo_point_2d
0,52,Tesco - Callington Road,"Callington Road, BS4 5AY",BS4 5AY,Brislington West,"Mon 8am then 24 hours until 10pm on Sats, Sun ...",Yes,0.0,Yes,0.0,...,N,Yes- during open times,N,Y,N,N,,N,"{""type"": ""Point"", ""coordinates"": [-2.554650233...","51.4324732718, -2.55465023322"
1,29,Sea Walls Public Toilets,Circular Road By Sea Walls viewing Area,BS9 1ZZ,Stoke Bishop,Monday - Saunday 7:30-18:00,2,1.0,3,,...,N,N,Y,Y,N,N,N,N,"{""type"": ""Point"", ""coordinates"": [-2.635545426...","51.4701425495, -2.63554542692"
2,54,Avonmouth Medical Centre (Pioneer Medical Group),"1 Collins Street, Bristol",BS11 9JJ,Avonmouth & Lawrence Weston,"Monday-Thursday 08:00 - 18:30, Friday 08:00-17:00",0,0.0,0,2.0,...,N,Y during open times,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.694532865...","51.4987516075, -2.69453286599"
3,2081,Ridingleaze Medical Centre (Pioneer Medical Gr...,"Ridingleaze, Lawrence Weston",BS11 0QE,Avonmouth & Lawrence Weston,Monday- Friday 08:00 - 18:30,0,0.0,0,1.0,...,N,no,N,N,N,N,,,"{""type"": ""Point"", ""coordinates"": [-2.659817532...","51.5004455574, -2.65981753253"
4,57,Barton Hill Settlement,"41-43 Ducie Road, BS5 0AX",BS5 0AX,Lawrence Hill,Monday - Friday 9:00 - 16:00,2,,2,0.0,...,N,Yes- during open time,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.563674974...","51.4562494756, -2.5636749747"


In [9]:
toilet_data.head(2)

Unnamed: 0,OBJECTID,Name,Address,Postcode,Ward,Toilet opening hours,Male (no. of),Urinals (no. of),Female (no. of),Unisex (no.of),...,Automatic opening door,Pull cord monitored (Y/N/Times),Radar Key,Baby change,Family Toilet,Automatic Public Convenience,Attended,Full Time Staffing,geo_shape,geo_point_2d
0,52,Tesco - Callington Road,"Callington Road, BS4 5AY",BS4 5AY,Brislington West,"Mon 8am then 24 hours until 10pm on Sats, Sun ...",Yes,0,Yes,0.0,...,N,Yes- during open times,N,Y,N,N,,N,"{""type"": ""Point"", ""coordinates"": [-2.554650233...","51.4324732718, -2.55465023322"
1,29,Sea Walls Public Toilets,Circular Road By Sea Walls viewing Area,BS9 1ZZ,Stoke Bishop,Monday - Saunday 7:30-18:00,2,1,3,,...,N,N,Y,Y,N,N,N,N,"{""type"": ""Point"", ""coordinates"": [-2.635545426...","51.4701425495, -2.63554542692"


Similarly, let's say we are only interested in quickly checking the last rows of the dataset. We can use x.tail(n) to print the last *n* rows, where *x* is the DataFrame variable (by default, n=5).

In [10]:
toilet_data.tail()

Unnamed: 0,OBJECTID,Name,Address,Postcode,Ward,Toilet opening hours,Male (no. of),Urinals (no. of),Female (no. of),Unisex (no.of),...,Automatic opening door,Pull cord monitored (Y/N/Times),Radar Key,Baby change,Family Toilet,Automatic Public Convenience,Attended,Full Time Staffing,geo_shape,geo_point_2d
102,641,"Ambition Lawrence Weston, Lawrence Weston Yout...","Lawrence Weston Road, Long Cross, Bristol",BS11 0RX,Avonmouth & Lawrence Weston,Monday to Friday 09:00-16:00 with ad-hoc addit...,,,,,...,,,,,,,,,"{""type"": ""Point"", ""coordinates"": [-2.653822353...","51.5049424918, -2.65382235353"
103,1745,Helios - Sea Mills Surgery,2 Riverleaze,BS9 2HL,Stoke Bishop,Monday-Friday 08:00-18:30,0.0,0.0,0.0,1.0,...,Y,yes during opening hours,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.650086369...","51.4849119244, -2.65008636941"
104,2417,Merrywood Practice,"Knowle Health Park, Downton Road",BS4 1WH,Filwood,08:30-18:30 Monday-Friday,2.0,0.0,2.0,0.0,...,Y,yes,N,Y,N,N,,Y,"{""type"": ""Point"", ""coordinates"": [-2.596459550...","51.4280468206, -2.59645955096"
105,70,Morrison's - Hartcliffe,"Hareclive Rd, Petersons Avenue, Hartcliffe",BS14 0BE,Hartcliffe & Withywood,7am - 10pm,2.0,3.0,3.0,,...,,Y during open hours,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.599634281...","51.405200404, -2.59963428185"
106,5153,the Life Boat,"25 High St, Bristol",BS11 0DX,Avonmouth & Lawrence Weston,Monday - Saturday 10:00 - 23:00 Sunday 12:00 ...,1.0,1.0,1.0,,...,N,,N,N,N,,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.675368178...","51.4890035948, -2.67536817895"


In [11]:
toilet_data.tail(2)

Unnamed: 0,OBJECTID,Name,Address,Postcode,Ward,Toilet opening hours,Male (no. of),Urinals (no. of),Female (no. of),Unisex (no.of),...,Automatic opening door,Pull cord monitored (Y/N/Times),Radar Key,Baby change,Family Toilet,Automatic Public Convenience,Attended,Full Time Staffing,geo_shape,geo_point_2d
105,70,Morrison's - Hartcliffe,"Hareclive Rd, Petersons Avenue, Hartcliffe",BS14 0BE,Hartcliffe & Withywood,7am - 10pm,2,3,3,,...,,Y during open hours,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.599634281...","51.405200404, -2.59963428185"
106,5153,the Life Boat,"25 High St, Bristol",BS11 0DX,Avonmouth & Lawrence Weston,Monday - Saturday 10:00 - 23:00 Sunday 12:00 ...,1,1,1,,...,N,,N,N,N,,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.675368178...","51.4890035948, -2.67536817895"


From a first glance at the table, we notice an 'OBJECTID', which, according to the dataset documentation seems to be a unique identifier for each location. Hence, it makes sense to use that as our row index. However, we first need to check whether there are any duplicates in the 'OBJECTID' column; if so, then it won't work as an index.  

In [12]:
OBJECTID_data = toilet_data['OBJECTID']
OBJECTID_data

0        52
1        29
2        54
3      2081
4        57
       ... 
102     641
103    1745
104    2417
105      70
106    5153
Name: OBJECTID, Length: 107, dtype: int64

In [13]:
test = OBJECTID_data.duplicated().any() # returns True if any duplicates, False otherwise
print(test)

False


Great, we have no duplicates! 

Now we can read the DataFrame again, but using the first column as an index column.

In [14]:
toilet_data = pd.read_csv(input_file, index_col = [0])  # use first column as the index
toilet_data

Unnamed: 0_level_0,Name,Address,Postcode,Ward,Toilet opening hours,Male (no. of),Urinals (no. of),Female (no. of),Unisex (no.of),Accessible (no. of),...,Automatic opening door,Pull cord monitored (Y/N/Times),Radar Key,Baby change,Family Toilet,Automatic Public Convenience,Attended,Full Time Staffing,geo_shape,geo_point_2d
OBJECTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
52,Tesco - Callington Road,"Callington Road, BS4 5AY",BS4 5AY,Brislington West,"Mon 8am then 24 hours until 10pm on Sats, Sun ...",Yes,0,Yes,0.0,1,...,N,Yes- during open times,N,Y,N,N,,N,"{""type"": ""Point"", ""coordinates"": [-2.554650233...","51.4324732718, -2.55465023322"
29,Sea Walls Public Toilets,Circular Road By Sea Walls viewing Area,BS9 1ZZ,Stoke Bishop,Monday - Saunday 7:30-18:00,2,1,3,,Yes,...,N,N,Y,Y,N,N,N,N,"{""type"": ""Point"", ""coordinates"": [-2.635545426...","51.4701425495, -2.63554542692"
54,Avonmouth Medical Centre (Pioneer Medical Group),"1 Collins Street, Bristol",BS11 9JJ,Avonmouth & Lawrence Weston,"Monday-Thursday 08:00 - 18:30, Friday 08:00-17:00",0,0,0,2.0,1,...,N,Y during open times,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.694532865...","51.4987516075, -2.69453286599"
2081,Ridingleaze Medical Centre (Pioneer Medical Gr...,"Ridingleaze, Lawrence Weston",BS11 0QE,Avonmouth & Lawrence Weston,Monday- Friday 08:00 - 18:30,0,0,0,1.0,1,...,N,no,N,N,N,N,,,"{""type"": ""Point"", ""coordinates"": [-2.659817532...","51.5004455574, -2.65981753253"
57,Barton Hill Settlement,"41-43 Ducie Road, BS5 0AX",BS5 0AX,Lawrence Hill,Monday - Friday 9:00 - 16:00,2,,2,0.0,2,...,N,Yes- during open time,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.563674974...","51.4562494756, -2.5636749747"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
641,"Ambition Lawrence Weston, Lawrence Weston Yout...","Lawrence Weston Road, Long Cross, Bristol",BS11 0RX,Avonmouth & Lawrence Weston,Monday to Friday 09:00-16:00 with ad-hoc addit...,,,,,1,...,,,,,,,,,"{""type"": ""Point"", ""coordinates"": [-2.653822353...","51.5049424918, -2.65382235353"
1745,Helios - Sea Mills Surgery,2 Riverleaze,BS9 2HL,Stoke Bishop,Monday-Friday 08:00-18:30,0,0,0,1.0,1,...,Y,yes during opening hours,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.650086369...","51.4849119244, -2.65008636941"
2417,Merrywood Practice,"Knowle Health Park, Downton Road",BS4 1WH,Filwood,08:30-18:30 Monday-Friday,2,0,2,0.0,1,...,Y,yes,N,Y,N,N,,Y,"{""type"": ""Point"", ""coordinates"": [-2.596459550...","51.4280468206, -2.59645955096"
70,Morrison's - Hartcliffe,"Hareclive Rd, Petersons Avenue, Hartcliffe",BS14 0BE,Hartcliffe & Withywood,7am - 10pm,2,3,3,,1,...,,Y during open hours,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.599634281...","51.405200404, -2.59963428185"


That's all there is to it! The data is now in a DataFrame and ready to be examined.  Similarly, can easily read in from a JSON or SQL database (for those who are familiar with them).

### Examining the data

Looking at the first few rows wasn't very helpful. There is too much information and it's not easy to see all of it. But we can do it better by running x.info().

In [15]:
toilet_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 52 to 5153
Data columns (total 22 columns):
Name                               107 non-null object
Address                            107 non-null object
Postcode                           107 non-null object
Ward                               107 non-null object
Toilet opening hours               105 non-null object
Male (no. of)                      82 non-null object
Urinals (no. of)                   66 non-null object
Female (no. of)                    82 non-null object
Unisex (no.of)                     51 non-null float64
Accessible (no. of)                107 non-null object
Accessibility info                 103 non-null object
Changing Place                     90 non-null object
Automatic opening door             75 non-null object
Pull cord monitored (Y/N/Times)    71 non-null object
Radar Key                          80 non-null object
Baby change                        86 non-null object
Family Toilet        

Now we have a nice summary of the database properties.

### Cleaning and manipulating the data

Let's assume that we want to look at the spread of publicly available toilets by postcode, so that we can see which areas need further work and encourage businesses to join the scheme.  We will want to know when the toilets in a certain postcode areas are open and whether they contain female and accessible toilets. So these fields need to be retained, while all the others can be removed. The steps we will follow are:
 - Remove columns that are clearly not necessary.
 - Have a look at data by postcode.
 - Save the cleaned-up data into a file so that all your hard work isn't lost!

First of all, let's see how to remove columns from the table that don't hold any useful information for us. For instance, the 'Ward' column can be removed with x.drop().

In [16]:
toilet_data_tmp = toilet_data.drop("Ward", axis = 1) # axis = 1 means we are dropping a column; use axis = 0 to drop rows
toilet_data_tmp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 52 to 5153
Data columns (total 21 columns):
Name                               107 non-null object
Address                            107 non-null object
Postcode                           107 non-null object
Toilet opening hours               105 non-null object
Male (no. of)                      82 non-null object
Urinals (no. of)                   66 non-null object
Female (no. of)                    82 non-null object
Unisex (no.of)                     51 non-null float64
Accessible (no. of)                107 non-null object
Accessibility info                 103 non-null object
Changing Place                     90 non-null object
Automatic opening door             75 non-null object
Pull cord monitored (Y/N/Times)    71 non-null object
Radar Key                          80 non-null object
Baby change                        86 non-null object
Family Toilet                      85 non-null object
Automatic Public Conve

Great! Now column "Ward" is no longer appearing in our new dataframe. However:
 - we'd really like to remove many columns at the same time
 - if we create a new dataframe every time we make a change, we will use lots of memory. It would be better to change the existing dataframe - as long as we are careful not to make mistakes when modifying it directly. 

In order to still have access to the original data in case we need to check something in it afterwards, let's use a separate variable for the cleaned-up data. In our case, we can use *toilet_data_tmp*.

**Note: if we want to overwrite the orginal data when removing columns, we need to use x.drop(..., inplace=True).** By default, x.drop(..., inplace=False) and the drop() function returns a copy of the input without the removed columns. By setting inplace=True, the function doesn't return an output, but overwrites the input.

In [17]:
toilet_data_tmp.drop(["Changing Place", "Name","Address", "Automatic opening door", "Pull cord monitored (Y/N/Times)","Male (no. of)",
              "Urinals (no. of)","Radar Key","Unisex (no.of)","Accessibility info",
              "Baby change", "Family Toilet", "Automatic Public Convenience", "Attended", "Full Time Staffing", 
              "geo_shape", "geo_point_2d"], axis = 1, inplace = True)

In [18]:
# Let's look at the result
toilet_data_tmp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 52 to 5153
Data columns (total 4 columns):
Postcode                107 non-null object
Toilet opening hours    105 non-null object
Female (no. of)         82 non-null object
Accessible (no. of)     107 non-null object
dtypes: object(4)
memory usage: 4.2+ KB


We were quite drastic when removing columns. In real life, you I might want to look a bit more carefully at some of the columns before deleting them in real life. For instance, if you by mistake want to remove again the same column or put the wrong column name, the above will give you an error. 

### Save data to file

Now we will save the modified table to file.

In [19]:
output_file = os.path.join(os.path.abspath(''), 'datafiles', 'dataset_acc.csv')
toilet_data_tmp.to_csv(output_file)

Now let's open the file and check the data inside.

In [20]:
input_file = os.path.join(os.path.abspath(''), 'datafiles', 'dataset_acc.csv')

new_data = pd.read_csv(input_file)

new_data

Unnamed: 0,OBJECTID,Postcode,Toilet opening hours,Female (no. of),Accessible (no. of)
0,52,BS4 5AY,"Mon 8am then 24 hours until 10pm on Sats, Sun ...",Yes,1
1,29,BS9 1ZZ,Monday - Saunday 7:30-18:00,3,Yes
2,54,BS11 9JJ,"Monday-Thursday 08:00 - 18:30, Friday 08:00-17:00",0,1
3,2081,BS11 0QE,Monday- Friday 08:00 - 18:30,0,1
4,57,BS5 0AX,Monday - Friday 9:00 - 16:00,2,2
...,...,...,...,...,...
102,641,BS11 0RX,Monday to Friday 09:00-16:00 with ad-hoc addit...,,1
103,1745,BS9 2HL,Monday-Friday 08:00-18:30,0,1
104,2417,BS4 1WH,08:30-18:30 Monday-Friday,2,1
105,70,BS14 0BE,7am - 10pm,3,1


Let us know set the 'OBJECTID' column (column 0) as the index column.

In [21]:
new_data = pd.read_csv(input_file, index_col = [0])

new_data

Unnamed: 0_level_0,Postcode,Toilet opening hours,Female (no. of),Accessible (no. of)
OBJECTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
52,BS4 5AY,"Mon 8am then 24 hours until 10pm on Sats, Sun ...",Yes,1
29,BS9 1ZZ,Monday - Saunday 7:30-18:00,3,Yes
54,BS11 9JJ,"Monday-Thursday 08:00 - 18:30, Friday 08:00-17:00",0,1
2081,BS11 0QE,Monday- Friday 08:00 - 18:30,0,1
57,BS5 0AX,Monday - Friday 9:00 - 16:00,2,2
...,...,...,...,...
641,BS11 0RX,Monday to Friday 09:00-16:00 with ad-hoc addit...,,1
1745,BS9 2HL,Monday-Friday 08:00-18:30,0,1
2417,BS4 1WH,08:30-18:30 Monday-Friday,2,1
70,BS14 0BE,7am - 10pm,3,1


### Additional DataFrame operations

Let's have a look at some of the functions that are available in Pandas.

#### x.describe()

According to the Pandas documentation, x.describe() generates descriptive statistics, e.g., a dataset’s distribution (excluding NaN values). By default only numeric fields are returned.

In [22]:
new_data.describe()

Unnamed: 0,Postcode,Toilet opening hours,Female (no. of),Accessible (no. of)
count,107,105,82,107
unique,100,98,13,8
top,BS5 6HL,"Monday-Friday 09:00-16:00, weekends and bank h...",2,1
freq,2,3,19,78


 - top is the most common value in the dataset 
 - freq is the frequency of the most common value

We would expect a few more characteristics to be shown as the output of the describe() function. For instance, the mean, 25% percentile, min and max for numeric columns are completely missing. This can be explained by the fact that if we check x.info(), we see that some of the columns are of type 'object' and this is a sign that they contain mixed types, e.g., integer and strings.

In [23]:
new_data.info() # as before

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 52 to 5153
Data columns (total 4 columns):
Postcode                107 non-null object
Toilet opening hours    105 non-null object
Female (no. of)         82 non-null object
Accessible (no. of)     107 non-null object
dtypes: object(4)
memory usage: 4.2+ KB


Let's have a closer look - first at the female column.

In [24]:
new_data['Female (no. of)'].unique() # gives the number of unique values in the column (aka Pandas Series)

array(['Yes', '3', '0', '2', '4', nan, '5', '1', '6', '19', '7', '9',
       'No', '24'], dtype=object)

We can also check the frequency with which each of these occur.

In [25]:
new_data['Female (no. of)'].value_counts() # gives the frequency of each value

2      19
0      15
3      14
1      13
5       5
4       5
6       3
9       2
7       2
19      1
24      1
No      1
Yes     1
Name: Female (no. of), dtype: int64

In [26]:
new_data.isna().sum() # gives the total instances of either blank, None or NaN

Postcode                 0
Toilet opening hours     2
Female (no. of)         25
Accessible (no. of)      0
dtype: int64

So, putting these together, I can deduce that for the 'female' column:
- there are 107 rows, of which 25 are either blank or NaN, and 87 have values
- the values are actually being held as strings (presumably why the .describe() function didn't return a wealth of information!
- there is 1 value each for 'No' and 'Yes'

First, let's tackle the NaNs. We don't know why they are NaN as we don't know about the source of the data. 

Here, you have mostly 2 options: 
- ignore completely the entries with NaNs
- overwrite them with a 'special' value, e.g., a very large integer value, which is larger than the current max or something else that makes sense

In this example, we will simply overwrite them with '99'.

In [27]:
new_data['Female (no. of)'].fillna(99, inplace = True) # Fills all NaN & blanks with 99.

new_data # Let's have a look and see if the NaN has gone

Unnamed: 0_level_0,Postcode,Toilet opening hours,Female (no. of),Accessible (no. of)
OBJECTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
52,BS4 5AY,"Mon 8am then 24 hours until 10pm on Sats, Sun ...",Yes,1
29,BS9 1ZZ,Monday - Saunday 7:30-18:00,3,Yes
54,BS11 9JJ,"Monday-Thursday 08:00 - 18:30, Friday 08:00-17:00",0,1
2081,BS11 0QE,Monday- Friday 08:00 - 18:30,0,1
57,BS5 0AX,Monday - Friday 9:00 - 16:00,2,2
...,...,...,...,...
641,BS11 0RX,Monday to Friday 09:00-16:00 with ad-hoc addit...,99,1
1745,BS9 2HL,Monday-Friday 08:00-18:30,0,1
2417,BS4 1WH,08:30-18:30 Monday-Friday,2,1
70,BS14 0BE,7am - 10pm,3,1


In [28]:
new_data['Female (no. of)'].unique()

array(['Yes', '3', '0', '2', '4', 99, '5', '1', '6', '19', '7', '9', 'No',
       '24'], dtype=object)

**Note: can you see that 99 appears now as an integer and not a string? That is because we didn't put it between quotes.** If we wanted it to be a string, we should have used *new_data['Female (no. of)'].fillna('99', inplace = True)*

Ok, it looks promising.  Let's just double-check by looking at the number of NaNs and the frequency with which values appear.

In [29]:
new_data.isna().sum()

Postcode                0
Toilet opening hours    2
Female (no. of)         0
Accessible (no. of)     0
dtype: int64

In [30]:
new_data['Female (no. of)'].value_counts() 

99     25
2      19
0      15
3      14
1      13
5       5
4       5
6       3
9       2
7       2
19      1
24      1
No      1
Yes     1
Name: Female (no. of), dtype: int64

Now let's address the 'Yes' and 'No' and the fact that the apparent integers are actually strings - all in one step.

The x.map(XYZ) function applies whatever is defined by XYZ to every value in a Pandas series.  Let's use this to convert all 'Yes' and 'No's.  We won't replace the original data (*inplace=True*) this time as it's quite a drastic change to make and it might go wrong and spoil the existing DataFrame, so let's create a new column to capture the modifications.

In [31]:
def yes_no(x):
    if x == 'Yes':
        x = 1
    elif x == 'No':
        x = 0
    elif type(x) == str:
        x = int(x)
    return x

new_data['Female'] = new_data['Female (no. of)'].map(yes_no) # changes yes to 1 and no to 0.  

Now let's check by looking at the frequency.

In [32]:
new_data['Female'].value_counts()

99    25
2     19
0     16
3     14
1     14
5      5
4      5
6      3
9      2
7      2
24     1
19     1
Name: Female, dtype: int64

Looks good!  Now let's check the DataFrame and the .describe() function to see if the results are any different.

In [33]:
new_data

Unnamed: 0_level_0,Postcode,Toilet opening hours,Female (no. of),Accessible (no. of),Female
OBJECTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
52,BS4 5AY,"Mon 8am then 24 hours until 10pm on Sats, Sun ...",Yes,1,1
29,BS9 1ZZ,Monday - Saunday 7:30-18:00,3,Yes,3
54,BS11 9JJ,"Monday-Thursday 08:00 - 18:30, Friday 08:00-17:00",0,1,0
2081,BS11 0QE,Monday- Friday 08:00 - 18:30,0,1,0
57,BS5 0AX,Monday - Friday 9:00 - 16:00,2,2,2
...,...,...,...,...,...
641,BS11 0RX,Monday to Friday 09:00-16:00 with ad-hoc addit...,99,1,99
1745,BS9 2HL,Monday-Friday 08:00-18:30,0,1,0
2417,BS4 1WH,08:30-18:30 Monday-Friday,2,1,2
70,BS14 0BE,7am - 10pm,3,1,3


In [34]:
new_data.describe()

Unnamed: 0,Female
count,107.0
mean,25.299065
std,41.009252
min,0.0
25%,1.0
50%,3.0
75%,14.0
max,99.0


This already looks much better, but don't forget that there are 25 values of 99, which is purely illustrative data!

Now we can safely remove the *Female (no. of)* column.

In [35]:
new_data.drop('Female (no. of)', axis = 1, inplace = True)

Let's do the same for the information on Accessible toilets.  We already know that there are no blanks, so we just need to see how many 'Yes' and 'No's there are.

In [36]:
new_data['Accessible (no. of)'].value_counts()

1      78
0      14
Yes     6
2       5
YES     1
4       1
3       1
5       1
Name: Accessible (no. of), dtype: int64

We will need to make a few a few adjustments to the previous map() function, because 'Yes' appears in different formats.

In [37]:
def yes(x):
    if x.lower() == 'yes':
        x = 1
    elif type(x) == str:
        x = int(x)
    return x

new_data['Acc'] = new_data['Accessible (no. of)'].map(yes) # changes yes to 1 and no to 0.  

In [38]:
new_data['Acc'].value_counts()

1    85
0    14
2     5
5     1
4     1
3     1
Name: Acc, dtype: int64

Let's check again the entire DataFrame before deleting the redundant column.

In [39]:
new_data

Unnamed: 0_level_0,Postcode,Toilet opening hours,Accessible (no. of),Female,Acc
OBJECTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
52,BS4 5AY,"Mon 8am then 24 hours until 10pm on Sats, Sun ...",1,1,1
29,BS9 1ZZ,Monday - Saunday 7:30-18:00,Yes,3,1
54,BS11 9JJ,"Monday-Thursday 08:00 - 18:30, Friday 08:00-17:00",1,0,1
2081,BS11 0QE,Monday- Friday 08:00 - 18:30,1,0,1
57,BS5 0AX,Monday - Friday 9:00 - 16:00,2,2,2
...,...,...,...,...,...
641,BS11 0RX,Monday to Friday 09:00-16:00 with ad-hoc addit...,1,99,1
1745,BS9 2HL,Monday-Friday 08:00-18:30,1,0,1
2417,BS4 1WH,08:30-18:30 Monday-Friday,1,2,1
70,BS14 0BE,7am - 10pm,1,3,1


In [40]:
new_data.drop('Accessible (no. of)', axis = 1, inplace = True)

In [41]:
new_data.describe()

Unnamed: 0,Female,Acc
count,107.0,107.0
mean,25.299065,1.0
std,41.009252,0.672927
min,0.0,0.0
25%,1.0,1.0
50%,3.0,1.0
75%,14.0,1.0
max,99.0,5.0


### Grouping

Finally, let's look at a Pandas function which allows you to take combinations of rows or columns. For example, a company might have a number of transactions for each customer, and they all need to be added together to look at the total sales distribution by customer.

In this case we want to see how the toilets are distributed around the city, so grouping by 'Postcode' seems sensible - though probably only based on the first 3 or 4 digits. It will be easier to remove the last 3 digits of the postcode rather than select the first 3 or 4 digits. Remember slicing of strings?

In [42]:
def last_three(string):
    return string[:-4] # note that I've removed 4 characters so that the whitespace is removed as well

new_data['postcode_short'] = new_data['Postcode'].map(last_three)
new_data

Unnamed: 0_level_0,Postcode,Toilet opening hours,Female,Acc,postcode_short
OBJECTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
52,BS4 5AY,"Mon 8am then 24 hours until 10pm on Sats, Sun ...",1,1,BS4
29,BS9 1ZZ,Monday - Saunday 7:30-18:00,3,1,BS9
54,BS11 9JJ,"Monday-Thursday 08:00 - 18:30, Friday 08:00-17:00",0,1,BS11
2081,BS11 0QE,Monday- Friday 08:00 - 18:30,0,1,BS11
57,BS5 0AX,Monday - Friday 9:00 - 16:00,2,2,BS5
...,...,...,...,...,...
641,BS11 0RX,Monday to Friday 09:00-16:00 with ad-hoc addit...,99,1,BS11
1745,BS9 2HL,Monday-Friday 08:00-18:30,0,1,BS9
2417,BS4 1WH,08:30-18:30 Monday-Friday,2,1,BS4
70,BS14 0BE,7am - 10pm,3,1,BS14


In [43]:
grouped =new_data.groupby(['postcode_short']).sum()
grouped

Unnamed: 0_level_0,Female,Acc
postcode_short,Unnamed: 1_level_1,Unnamed: 2_level_1
BS1,673,30
BS10,115,6
BS11,201,5
BS13,106,4
BS14,111,4
BS15,99,0
BS16,106,5
BS2,7,1
BS3,220,12
BS4,221,10


The Female column doesn't look particularly sensible because we used 99 instead of blanks - you can improve on this in the exercises.

You might also want to carry out some sanity checking on the result to make sure that things like count, sum, mean etc are consistent with the original data, ie the code hasn't gone wrong somewhere!