## Data Science Bootcamp or Data Analyst Apprenticeship - see Slack Channel

### Pandas ("panel data")

#### Pandas package is used to explore data, eg
 - Calculate statistical values (mean, max, min etc)
 - Clean data (dealing with outliers, missing data etc)
 - Manipulate data to get into an easier form to deal with
 - Plot data via Matplotlib
 - Feed data into machine learning packages, eg Scikit-learn

#### Syntax is very similar to Numpy.  Pandas is already installed as part of Anaconda installation - otherwise would need to install it on your computer as a first step.

### What are pandas - series and dataframes

##### Example of a 'Series' and 'DataFrame' below; we won't worry to much about creating from scratch as, typically, you will use pandas for reading in data from a file or similar

In [None]:
import pandas as pd   #use pandas package and refer to it as 'pd' to save writing out

# A series is effectively a column of data

data_Series = [12, 8, 10, 11]

attendance = pd.Series(data_Series)

print(attendance)

##### We could make the labelling a bit more helpful

In [None]:
# Add some row labels

attendance_labelled = pd.Series(data_Series, index = ['week_1', 'week_2', 'week_3', 'week_4'])

attendance_labelled

###### A DataFrame is multiple Series, stuck together

In [None]:
import pandas as pd   

data_df = {'Python' : [12, 8, 10, 11], 'Web101' : [6, 10, 8, 11], 'AI' : [11, 7, 6, 9]}

profiles = pd.DataFrame(data_df, index = ['week_1', 'week_2', 'week_3', 'week_4'])

profiles

### Reading in data

##### Let's use a local database - if only to promote interesting datasets that are available.  This was chosen on the basis that it is quite small and illustrates some of the features you will tend to find - many more interesting ones are available:  https://opendata.bristol.gov.uk/pages/homepage/

In [6]:
# Note that all the import statements should go at the top of a programme
import pandas as pd
import os

input_file = os.path.join(os.path.abspath(''), 'datafiles', 'public-toilets-community.csv')

toilet_data = pd.read_csv(input_file)

toilet_df = pd.DataFrame(toilet_data)

toilet_df.head() # print 1st 5 rows by default.  Can use .head(n) to print 1st n rows, or tail(n) for last n rows

Unnamed: 0,OBJECTID,Name,Address,Postcode,Ward,Toilet opening hours,Male (no. of),Urinals (no. of),Female (no. of),Unisex (no.of),...,Automatic opening door,Pull cord monitored (Y/N/Times),Radar Key,Baby change,Family Toilet,Automatic Public Convenience,Attended,Full Time Staffing,geo_shape,geo_point_2d
0,52,Tesco - Callington Road,"Callington Road, BS4 5AY",BS4 5AY,Brislington West,"Mon 8am then 24 hours until 10pm on Sats, Sun ...",Yes,0.0,Yes,0.0,...,N,Yes- during open times,N,Y,N,N,,N,"{""type"": ""Point"", ""coordinates"": [-2.554650233...","51.4324732718, -2.55465023322"
1,29,Sea Walls Public Toilets,Circular Road By Sea Walls viewing Area,BS9 1ZZ,Stoke Bishop,Monday - Saunday 7:30-18:00,2,1.0,3,,...,N,N,Y,Y,N,N,N,N,"{""type"": ""Point"", ""coordinates"": [-2.635545426...","51.4701425495, -2.63554542692"
2,54,Avonmouth Medical Centre (Pioneer Medical Group),"1 Collins Street, Bristol",BS11 9JJ,Avonmouth & Lawrence Weston,"Monday-Thursday 08:00 - 18:30, Friday 08:00-17:00",0,0.0,0,2.0,...,N,Y during open times,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.694532865...","51.4987516075, -2.69453286599"
3,2081,Ridingleaze Medical Centre (Pioneer Medical Gr...,"Ridingleaze, Lawrence Weston",BS11 0QE,Avonmouth & Lawrence Weston,Monday- Friday 08:00 - 18:30,0,0.0,0,1.0,...,N,no,N,N,N,N,,,"{""type"": ""Point"", ""coordinates"": [-2.659817532...","51.5004455574, -2.65981753253"
4,57,Barton Hill Settlement,"41-43 Ducie Road, BS5 0AX",BS5 0AX,Lawrence Hill,Monday - Friday 9:00 - 16:00,2,,2,0.0,...,N,Yes- during open time,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.563674974...","51.4562494756, -2.5636749747"


##### We can see that there is an 'OBJECTID' which (from reading the documentation) I know is a unique identifier for each location.  Hence, it makes sense to use that as our row index - but first we need to check whether there are any duplicates in the column; if so, then it won't work as an index.  

In [7]:
test = toilet_df['OBJECTID'].duplicated().any() #returns True if any duplicates, False otherwise
print(test)

False


##### Good - now we can read in the DataFrame again, but using the first column as an index column:

In [8]:
toilet_data = pd.read_csv(input_file, index_col = [0]) # ie use first column as the index
toilet_df = pd.DataFrame(toilet_data)
toilet_df.head()

Unnamed: 0_level_0,Name,Address,Postcode,Ward,Toilet opening hours,Male (no. of),Urinals (no. of),Female (no. of),Unisex (no.of),Accessible (no. of),...,Automatic opening door,Pull cord monitored (Y/N/Times),Radar Key,Baby change,Family Toilet,Automatic Public Convenience,Attended,Full Time Staffing,geo_shape,geo_point_2d
OBJECTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
52,Tesco - Callington Road,"Callington Road, BS4 5AY",BS4 5AY,Brislington West,"Mon 8am then 24 hours until 10pm on Sats, Sun ...",Yes,0.0,Yes,0.0,1,...,N,Yes- during open times,N,Y,N,N,,N,"{""type"": ""Point"", ""coordinates"": [-2.554650233...","51.4324732718, -2.55465023322"
29,Sea Walls Public Toilets,Circular Road By Sea Walls viewing Area,BS9 1ZZ,Stoke Bishop,Monday - Saunday 7:30-18:00,2,1.0,3,,Yes,...,N,N,Y,Y,N,N,N,N,"{""type"": ""Point"", ""coordinates"": [-2.635545426...","51.4701425495, -2.63554542692"
54,Avonmouth Medical Centre (Pioneer Medical Group),"1 Collins Street, Bristol",BS11 9JJ,Avonmouth & Lawrence Weston,"Monday-Thursday 08:00 - 18:30, Friday 08:00-17:00",0,0.0,0,2.0,1,...,N,Y during open times,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.694532865...","51.4987516075, -2.69453286599"
2081,Ridingleaze Medical Centre (Pioneer Medical Gr...,"Ridingleaze, Lawrence Weston",BS11 0QE,Avonmouth & Lawrence Weston,Monday- Friday 08:00 - 18:30,0,0.0,0,1.0,1,...,N,no,N,N,N,N,,,"{""type"": ""Point"", ""coordinates"": [-2.659817532...","51.5004455574, -2.65981753253"
57,Barton Hill Settlement,"41-43 Ducie Road, BS5 0AX",BS5 0AX,Lawrence Hill,Monday - Friday 9:00 - 16:00,2,,2,0.0,2,...,N,Yes- during open time,N,Y,N,N,Y,Y,"{""type"": ""Point"", ""coordinates"": [-2.563674974...","51.4562494756, -2.5636749747"


##### That's all there is to it - the data is now in a DataFrame and ready to be examined.  Similarly, can easily read in from JSON or SQL databases (for those who are familiar with them)

### Examining the data

##### Looking at the first few rows wasn't very helpful - too much information so not nicely laid out.  First command that is normally run on a new dataset is .info()

In [9]:
toilet_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 52 to 5153
Data columns (total 22 columns):
Name                               107 non-null object
Address                            107 non-null object
Postcode                           107 non-null object
Ward                               107 non-null object
Toilet opening hours               105 non-null object
Male (no. of)                      82 non-null object
Urinals (no. of)                   66 non-null object
Female (no. of)                    82 non-null object
Unisex (no.of)                     51 non-null float64
Accessible (no. of)                107 non-null object
Accessibility info                 103 non-null object
Changing Place                     90 non-null object
Automatic opening door             75 non-null object
Pull cord monitored (Y/N/Times)    71 non-null object
Radar Key                          80 non-null object
Baby change                        86 non-null object
Family Toilet        

### Cleaning and manipulating the data

##### Let's assume that we want to look at the spread of publicly available toilets by post-code so we can see which areas need further work to encourage businesses to join the scheme.  We will want to know when the are open and whether there are accessible toilets by postcode.  The steps we will follow are:
 - Remove columns that are clearly not necessary
 - Look more closely at the data to see if there is any clearly wrong or missing - decide what to do about it
 - Have a look at data by postcode
 - Save as a file so that hard work isn't lost!

In [10]:
# the drop() function can be used to drop a column

df_rdcd = toilet_df.drop("Ward", axis = 1) # axis = 1 means we are dropping column; use axis = 0 to drop rows
df_rdcd.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 52 to 5153
Data columns (total 21 columns):
Name                               107 non-null object
Address                            107 non-null object
Postcode                           107 non-null object
Toilet opening hours               105 non-null object
Male (no. of)                      82 non-null object
Urinals (no. of)                   66 non-null object
Female (no. of)                    82 non-null object
Unisex (no.of)                     51 non-null float64
Accessible (no. of)                107 non-null object
Accessibility info                 103 non-null object
Changing Place                     90 non-null object
Automatic opening door             75 non-null object
Pull cord monitored (Y/N/Times)    71 non-null object
Radar Key                          80 non-null object
Baby change                        86 non-null object
Family Toilet                      85 non-null object
Automatic Public Conve

##### Good - "ward" is no longer appearing in our new dataframe.  However:
 - we'd really like to remove many columns at the same time, and
 - if we create a new dataframe every time we make a change then we will end up using lots of memory.  It would be better to change the existing dataframe - so long as recreating it isn't a problem in the event of mistakes!!

In [11]:
# We'll do both of these steps together
# Put column headings in a list, and use inplace =True as an argument

df_rdcd.drop(["Changing Place", "Name","Address", "Automatic opening door", "Pull cord monitored (Y/N/Times)","Male (no. of)",
              "Urinals (no. of)","Radar Key","Unisex (no.of)","Accessibility info",
              "Baby change", "Family Toilet", "Automatic Public Convenience", "Attended", "Full Time Staffing", 
              "geo_shape", "geo_point_2d"], axis = 1, inplace = True)



In [12]:
df_rdcd

Unnamed: 0_level_0,Postcode,Toilet opening hours,Female (no. of),Accessible (no. of)
OBJECTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
52,BS4 5AY,"Mon 8am then 24 hours until 10pm on Sats, Sun ...",Yes,1
29,BS9 1ZZ,Monday - Saunday 7:30-18:00,3,Yes
54,BS11 9JJ,"Monday-Thursday 08:00 - 18:30, Friday 08:00-17:00",0,1
2081,BS11 0QE,Monday- Friday 08:00 - 18:30,0,1
57,BS5 0AX,Monday - Friday 9:00 - 16:00,2,2
...,...,...,...,...
641,BS11 0RX,Monday to Friday 09:00-16:00 with ad-hoc addit...,,1
1745,BS9 2HL,Monday-Friday 08:00-18:30,0,1
2417,BS4 1WH,08:30-18:30 Monday-Friday,2,1
70,BS14 0BE,7am - 10pm,3,1


##### I've been rather drastic - I might have wanted to look a bit more carefully at some of the columns before deleting them in real life, but it will do as an example.  I might just save a copy so that if I stop work and pick it up again later, I don't have to re-run all the previous cells (not particularly an issue for this example, but could be if you are looking at the Air Quality dataset which has several sensors sending hourly data for many years!

In [13]:
# Similar code to reading in the file
import os

output_file = os.path.join(os.path.abspath(''), 'datafiles', 'dataset_acc.csv')
df_rdcd.to_csv(output_file)

##### So now I'll re-import pandas library, so that if I want to pick up my work at a later stage, then I can start from here.

In [14]:
import pandas as pd
import os

input_filev1 = os.path.join(os.path.abspath(''), 'datafiles', 'dataset_acc.csv')

acc_data = pd.read_csv(input_filev1, index_col = [0])

acc_data.head()                            

Unnamed: 0_level_0,Postcode,Toilet opening hours,Female (no. of),Accessible (no. of)
OBJECTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
52,BS4 5AY,"Mon 8am then 24 hours until 10pm on Sats, Sun ...",Yes,1
29,BS9 1ZZ,Monday - Saunday 7:30-18:00,3,Yes
54,BS11 9JJ,"Monday-Thursday 08:00 - 18:30, Friday 08:00-17:00",0,1
2081,BS11 0QE,Monday- Friday 08:00 - 18:30,0,1
57,BS5 0AX,Monday - Friday 9:00 - 16:00,2,2


##### Let's have a look at some of the functions that are available in pandas

In [19]:
acc_data.describe()

Unnamed: 0,Postcode,Toilet opening hours,Female (no. of),Accessible (no. of)
count,107,105,82,107
unique,100,98,13,8
top,BS1 5RR,"Monday-Friday 09:00-16:00, weekends and bank h...",2,1
freq,2,3,19,78


#####  top is the most frequent value in that column.  freq is the frequency with which the 'top' value appears (eg, 2 is that most frequent number of female toilets available and there are 19 locations with this number)

##### However, I am a little surprised that more characteristics aren't shown as a result of the describe() function - I'd normally expect to see the mean, 25% percentile, min and max for numeric columns.However, I can see from the .info() output that some of the columns are of type 'object' and this is a sign that they contain mixed types, eg, integer and strings.  Let's have a closer look.

In [23]:
acc_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 52 to 5153
Data columns (total 4 columns):
Postcode                107 non-null object
Toilet opening hours    105 non-null object
Female (no. of)         82 non-null object
Accessible (no. of)     107 non-null object
dtypes: object(4)
memory usage: 4.2+ KB


In [24]:
acc_data['Female (no. of)'].unique()

array(['Yes', '3', '0', '2', '4', nan, '5', '1', '6', '19', '7', '9',
       'No', '24'], dtype=object)

##### Even better, we can see the frequenct with which each of these occur, as follows

In [33]:
acc_data['Female (no. of)'].value_counts()

2      19
0      15
3      14
1      13
5       5
4       5
6       3
7       2
9       2
Yes     1
No      1
24      1
19      1
Name: Female (no. of), dtype: int64

##### So, judgement is now required.  'No' can probably be best replaced by 0. 'Yes' means that there is at least one - if we had more time to look at the other columns we might be able to make an informed guess.  For now, maybe just be conservative and assume that 'Yes' can be replaced by 1.

##### The .map(XYZ) function applieswhatever is defined by XYZ to every value in a panda series.  Let's use this to convert all 'yes' and 'no's.

In [48]:
def yes_no(x):
    if x == 'Yes':
        x = 1
    elif x == 'No':
        x = 0
    elif type(x) == str:
        x = int(x)
    return x

acc_data['Female'] = acc_data['Female (no. of)'].map(yes_no) # changes yes to 1 and no to 0

In [49]:

acc_data['Female'].value_counts() # now let's check by looking at the frequency

2.0     19
0.0     16
3.0     14
1.0     14
5.0      5
4.0      5
6.0      3
9.0      2
7.0      2
24.0     1
19.0     1
Name: Female, dtype: int64