## Data Science Bootcamp or Data Analyst Apprenticeship - see Slack Channel

### Pandas ("panel data")

#### Pandas package is used to explore data, eg
 - Calculate statistical values (mean, max, min etc)
 - Clean data (dealing with outliers, missing data etc)
 - Manipulate data to get into an easier form to deal with
 - Plot data via Matplotlib
 - Feed data into machine learning packages, eg Scikit-learn

#### Syntax is very similar to Numpy.  Pandas is already installed as part of Anaconda installation - otherwise would need to install it on your computer as a first step.

### What are pandas - series and dataframes

##### Example of a 'Series' and 'DataFrame' below; we won't worry too much about creating from scratch as, typically, you will use pandas for reading in data from a file or similar

In [None]:
import pandas as pd   # use pandas package and refer to it as 'pd' to save writing out

# A series is effectively a column of data

data_Series = [12, 8, 10, 11]

attendance = pd.Series(data_Series)

print(attendance)

##### We could make the labelling a bit more helpful

In [None]:
# Add some row labels

attendance_labelled = pd.Series(data_Series, index = ['week_1', 'week_2', 'week_3', 'week_4'])

attendance_labelled

###### A DataFrame is multiple Series, stuck together

In [None]:
import pandas as pd   

data_df = {'Python' : [12, 8, 10, 11], 'Web101' : [6, 10, 8, 11], 'AI' : [11, 7, 6, 9]}

profiles = pd.DataFrame(data_df, index = ['week_1', 'week_2', 'week_3', 'week_4'])

profiles

### Reading in data

##### Let's use a local database - if only to promote interesting datasets that are available.  This was chosen on the basis that it is quite small and illustrates some of the features you will tend to find - many more interesting ones are available:  https://opendata.bristol.gov.uk/pages/homepage/

In [None]:
# Note that all the import statements should go at the top of a programme
import pandas as pd
import os

input_file = os.path.join(os.path.abspath(''), 'datafiles', 'public-toilets-community.csv')

toilet_data = pd.read_csv(input_file)

toilet_df = pd.DataFrame(toilet_data)

toilet_df.head() # print 1st 5 rows by default.  Can use .head(n) to print 1st n rows, or tail(n) for last n rows

##### We can see that there is an 'OBJECTID' which (from reading the documentation) I know is a unique identifier for each location.  Hence, it makes sense to use that as our row index - but first we need to check whether there are any duplicates in the column; if so, then it won't work as an index.  

In [None]:
test = toilet_df['OBJECTID'].duplicated().any() # returns True if any duplicates, False otherwise
print(test)

##### Good - now we can read in the DataFrame again, but using the first column as an index column:

In [None]:
toilet_data = pd.read_csv(input_file, index_col = [0]) # ie use first column as the index
toilet_df = pd.DataFrame(toilet_data)
toilet_df.head()

##### That's all there is to it - the data is now in a DataFrame and ready to be examined.  Similarly, can easily read in from JSON or SQL databases (for those who are familiar with them)

### Examining the data

##### Looking at the first few rows wasn't very helpful - too much information so not nicely laid out.  First command that is normally run on a new dataset is .info()

In [None]:
toilet_df.info()

### Cleaning and manipulating the data

##### Let's assume that we want to look at the spread of publicly available toilets by post-code so we can see which areas need further work to encourage businesses to join the scheme.  We will want to know when the are open and whether there are accessible toilets by postcode.  The steps we will follow are:
 - Remove columns that are clearly not necessary
 - Look more closely at the data to see if there is any clearly wrong or missing - decide what to do about it
 - Have a look at data by postcode
 - Save as a file so that hard work isn't lost!

In [None]:
# the drop() function can be used to drop a column

df_rdcd = toilet_df.drop("Ward", axis = 1) # axis = 1 means we are dropping column; use axis = 0 to drop rows
df_rdcd.info()


##### Good - "ward" is no longer appearing in our new dataframe.  However:
 - we'd really like to remove many columns at the same time, and
 - if we create a new dataframe every time we make a change then we will end up using lots of memory.  It would be better to change the existing dataframe - so long as recreating it isn't a problem in the event of mistakes!!

In [None]:
# We'll do both of these steps together
# Put column headings in a list, and use inplace =True as an argument

df_rdcd.drop(["Changing Place", "Name","Address", "Automatic opening door", "Pull cord monitored (Y/N/Times)","Male (no. of)",
              "Urinals (no. of)","Radar Key","Unisex (no.of)","Accessibility info",
              "Baby change", "Family Toilet", "Automatic Public Convenience", "Attended", "Full Time Staffing", 
              "geo_shape", "geo_point_2d"], axis = 1, inplace = True)



In [None]:
df_rdcd

##### I've been rather drastic - I might have wanted to look a bit more carefully at some of the columns before deleting them in real life, but it will do as an example.  I might just save a copy so that if I stop work and pick it up again later, I don't have to re-run all the previous cells (not particularly an issue for this example, but could be if you are looking at the Air Quality dataset which has several sensors sending hourly data for many years!

In [None]:
# Similar code to reading in the file
import os

output_file = os.path.join(os.path.abspath(''), 'datafiles', 'dataset_acc.csv')
df_rdcd.to_csv(output_file)

##### So now I'll re-import pandas library, so that if I want to pick up my work at a later stage, then I can start from here.

In [None]:
import pandas as pd
import os

input_filev1 = os.path.join(os.path.abspath(''), 'datafiles', 'dataset_acc.csv')

acc_data = pd.read_csv(input_filev1, index_col = [0])

acc_data.head()                            

##### Let's have a look at some of the functions that are available in pandas

In [None]:
acc_data.describe()

#####  top is the most frequent value in that column.  freq is the frequency with which the 'top' value appears (eg, 2 is that most frequent number of female toilets available and there are 19 locations with this number)

##### However, I am a little surprised that more characteristics aren't shown as a result of the describe() function - I'd normally expect to see the mean, 25% percentile, min and max for numeric columns.However, I can see from the .info() output that some of the columns are of type 'object' and this is a sign that they contain mixed types, eg, integer and strings.  Let's have a closer look.

In [None]:
acc_data.info()

In [None]:
acc_data['Female (no. of)'].unique()

##### Even better, we can see the frequenct with which each of these occur, as follows

In [None]:
acc_data['Female (no. of)'].value_counts()

##### So, judgement is now required.  'No' can probably be best replaced by 0. 'Yes' means that there is at least one - if we had more time to look at the other columns we might be able to make an informed guess.  For now, maybe just be conservative and assume that 'Yes' can be replaced by 1.

##### The .map(XYZ) function applieswhatever is defined by XYZ to every value in a panda series.  Let's use this to convert all 'yes' and 'no's.

In [None]:
def yes_no(x):
    if x == 'Yes':
        x = 1
    elif x == 'No':
        x = 0
    elif type(x) == str:
        x = int(x)
    return x

acc_data['Female'] = acc_data['Female (no. of)'].map(yes_no) # changes yes to 1 and no to 0

In [None]:

acc_data['Female'].value_counts() # now let's check by looking at the frequency