<a href="https://colab.research.google.com/github/Doongka/GHDColabExamples/blob/master/Visualising_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Why use Python instead of Excel for data analysis?
The water sector is a data rich environment and the amount of data being collected is steadily increasing.  This provides us with new opportunities for exploiting this data in our day to day work.

Even though Excel is a powerful tool, it does have some limitations.  Most notably is the poor performance with a spreadsheet full of formula and the row/column limitation. When our data is too big for a spreadsheet or we require lots of calculations we can create a coded solution using a language like Python.

 So what makes Python so attractive for data analysis?

1.   Readable and easily maintainable
2.   Very easy to learn and use.
3.   Open source (FREE!) and feature rich with lots of libraries available
4.   Cross platform support. i.e. code written on Mac can run on Windows (mostly)
5.   Most importantly, it's better suited for big data applications (Not the best but the easiest)



# Data Visualization Example
The goal of this notebook is to provide a quick overview of some useful Python tools used for cleaning, analysing and visualizing data.

The notebook has  been created using Google Colab, a free cloud-based notebook environment that allows you to write and execute Python code without needing to set up your own local Python environment.

Anyone can use this notebook and I'll send a link out on an email.  It's sitting in my online code repository located here:-

https://colab.research.google.com/github/Doongka/GHDColabExamples/blob/master/Visualising_Data.ipynb

This notebook will allow you to run the code in segments as you go.  When you get to a code block, you can execute (run) the code by pressing CTRL-ENTER.  

Try it on the code block below:-

In [16]:
# assign "Hello World!" to a new variable "greeting"
greeting = "Hello World!"

# run the "print" function with "greeting" as an argument. i.e. Print to screen whatever is stored in "greeting"
print(greeting)

Hello World!


If successful you will see the words "Hello World!" output below the code.

Something important to note is that a # symbol at the start of a line causes Python to ignore that particular line. This is useful to provide a description of your code like I've done above or to stop lines of code from running with debugging. A quick way to "comment" and "uncomment" your code is to press CTRL-/



## Prepare the workspace
Google Colab has many useful Python packages preinstalled such so there shouldn't be any need to install them yourself.  Packages are basically collections of useful functions.  If case you do need something that is not installed, you can run a "pip" installation as per below.  

You can see that I've "commented" the bit of code that installs the package "".  Uncomment this line and run the code block.


In [0]:
# We place a ! in front of the command to indicate that we want this to run in a console.
# !pip install altair

## Uploading your data to Colab
The first step in any data analysis is to prepare the data.  As we are running this in Google Colab, we need to upload any datasets.  There are several ways to do this but for this example I've used the most intuitive. 

I've downloaded some temperature data from BOM to try out. You can download the test data here:-

https://raw.githubusercontent.com/Doongka/GHDColabExamples/master/dataset/brisbanetemp.csv

After running the cell below, you will see a "Choose Files" button that will allow you to select the files you wish to upload.  This example has been designed to handle a single CSV file so point to where you saved the "brisbanetemp.csv" file on your local machine




In [3]:
from google.colab import files

uploaded = files.upload()

Saving brisbanetemp.csv to brisbanetemp.csv


Running the next cell will give you an overview of the files you have just uploaded.  This is a handy test to make sure it uploaded correctly.

In [4]:
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

User uploaded file "brisbanetemp.csv" with length 590942 bytes


## Wrangling your data

Python has many pre-built packages that help with any extract, transform and load (ETL) operations required to get the data ready.

Pandas is a popular library that provides many useful methods for data manipulation. We will be using the "Dataframe" functionality which will allow us to import data from a CSV file and perform several data wrangling and cleansing tasks.



In [0]:
#These commands allow us to load the libraries in to our current notebook
import pandas as pd

## Loading your data
Before we can do anything with our data, we first need to load it.  Pandas has some great methods to read files and load them into a useful format. 

In the code block below, we are loading a CSV file and tranforming it in the form of a Dataframe object.  A Dataframe is effectively Pandas's answer to storing data in a tabular format.

In [6]:
# read the csv and load the contents into memory as a dataframe
data = pd.read_csv('brisbanetemp.csv')

# the head() method allows us to view the first 5 entries in the dataframe
data.head()

Unnamed: 0,date,maximum temperature (degC),minimum temperature (degC),site number,site name
0,,,,40842.0,BRISBANE AERO
1,4/06/1949,14.0,2.4,,
2,5/06/1949,18.2,10.0,,
3,6/06/1949,21.0,5.6,,
4,7/06/1949,20.5,7.1,,


# Cleaning up your data
You can see from the table above that row 0 has data that is unable to be read.  "NaN" means "Not a Number" and in this case was caused by empty cells in the first row of the data.  Let's remove this row entry using the following command. 

In [7]:
# Drop the first row from the dataframe and assign to a new Dataframe called "cleansed_data"
cleansed_data = data.drop(data.index[0])

# Let's have a look to see if it worked
cleansed_data.head()

Unnamed: 0,date,maximum temperature (degC),minimum temperature (degC),site number,site name
1,4/06/1949,14.0,2.4,,
2,5/06/1949,18.2,10.0,,
3,6/06/1949,21.0,5.6,,
4,7/06/1949,20.5,7.1,,
5,8/06/1949,20.5,6.1,,


We don't need the columns "Site Number" or "Site Name" so we'll drop these too.

In [8]:
# drop the columns 'site number' and 'site name'
cleansed_data = cleansed_data.drop(['site number', 'site name'], axis=1)

# check to see what we're left with
cleansed_data.head()

Unnamed: 0,date,maximum temperature (degC),minimum temperature (degC)
1,4/06/1949,14.0,2.4
2,5/06/1949,18.2,10.0
3,6/06/1949,21.0,5.6
4,7/06/1949,20.5,7.1
5,8/06/1949,20.5,6.1


Looking good but we should check the rest of the data.  Running the "info" method gives us information on the type of data stored in each column and also the number of non-null (not empty or NaN) entries. 

In [9]:
cleansed_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25564 entries, 1 to 25564
Data columns (total 3 columns):
date                          25564 non-null object
maximum temperature (degC)    25537 non-null float64
minimum temperature (degC)    25525 non-null float64
dtypes: float64(2), object(1)
memory usage: 798.9+ KB


We can see that the number of non-null entries for each column is not equal.  This means there are null entries sitting in our table.  We can search for them by doing the following

In [10]:
# Find the null values in the first column
missing_data = cleansed_data['maximum temperature (degC)'].isna()

# find the null values in the second column.  We are using a logical OR (|) to get a combined list of null entries
missing_data = missing_data | cleansed_data['minimum temperature (degC)'].isna()

# Count the number of rows that have null entries
missing_data.sum()

63

We have 63 rows that contain at least one null value.  Let's take a look:-

In [11]:
# Show only the rows that contain the null values.  Note the format.  The input number represents the rows we want, the second is the columns we want.  A colon (:)
# means we want to show all columns
cleansed_data.loc[missing_data,:]

Unnamed: 0,date,maximum temperature (degC),minimum temperature (degC)
623,16/02/1951,,17.9
624,17/02/1951,,17.6
625,18/02/1951,,16.4
626,19/02/1951,,20.7
627,20/02/1951,,21.6
...,...,...,...
10288,3/08/1977,21.6,
10822,19/01/1979,,22.4
11223,24/02/1980,29.4,
14000,2/10/1987,23.2,


We have several options at out disposal here to handle the null entries, we can:

  

*   delete the rows but that's generally not a good idea for time series data 
*   back or forward filling the entry using the values around it
*   interpolate between values

Let's interpolate.



In [12]:
# We can remove all of the rows that have NaNs.  Probably not a good idea for a time series plot but necessary if the data can't be imputed
# final_data = cleansed_data.dropna(how='any')

# We can back fill the data based on the data either side
# final_data = cleansed_data.fillna(method='bfill')
# Or forward fill
# final_data = cleansed_data.fillna(method='ffill')

# We can interpolate 
final_data = cleansed_data.interpolate(method='linear')
final_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25564 entries, 1 to 25564
Data columns (total 3 columns):
date                          25564 non-null object
maximum temperature (degC)    25564 non-null float64
minimum temperature (degC)    25564 non-null float64
dtypes: float64(2), object(1)
memory usage: 798.9+ KB


We can see that we now have the equal numbers of entries for each column.  The last thing we need to do is to make sure the date is in a usable format for the graph.  We can do this by using the "to_datetime" method in Pandas

In [13]:
# Fix the data column so it is in a usable format
final_data['date'] = pd.to_datetime(final_data['date'],format="%d/%m/%Y")

# Rename the columns so they're easier to read
final_data = final_data.rename(columns={"maximum temperature (degC)": "max_temp", "minimum temperature (degC)": "min_temp"})

# take a final look at the data
final_data.tail()

Unnamed: 0,date,max_temp,min_temp
25560,2019-05-27,27.1,11.8
25561,2019-05-28,21.6,9.5
25562,2019-05-29,22.6,7.5
25563,2019-05-30,22.5,8.5
25564,2019-05-31,19.7,5.2


We can also get a summary of key statistics by using the .describe() command


In [15]:
final_data.describe()

Unnamed: 0,max_temp,min_temp
count,25564.0,25564.0
mean,25.139225,15.362649
std,3.643943,5.21601
min,9.7,-1.8
25%,22.3,11.6
50%,25.4,16.1
75%,28.0,19.5
max,40.2,28.1


## Visualizing your Data


Altair is a great data visualization library that is preinstalled in Colab.  It is a statistical visualization language with lots of plot types and statistical functions.



In [0]:
# load up the Altair package to give us access to advanced visualization methods
import altair as alt

# Altair has a default 5000 entry limit, this can be turned off but we'll keep it on for this example
start_date = '2008-01-01'
end_date =  '2018-12-31'

# We need to create a mask to pull out the data we want.  This basically tells us the row indexes that sit between the two dates
mask = (final_data['date'] > start_date) & (final_data['date'] <= end_date)

# We'll create a new dataframe to store our plot data.  The .loc[mask] method returns the entries based on the indexes we obtained above
plot_data = final_data.loc[mask]




In [0]:
# Altair is a bit involved to produce a graph.  The code below produces an interactive scatterplot that lets us zoom in and out
alt.Chart(plot_data).mark_circle().encode(
    alt.X('date:T',
          scale=alt.Scale(zero=False)
    ),
    alt.Y('max_temp:Q',
          scale=alt.Scale(zero=False)
    ),
    tooltip=['date', 'max_temp', 'min_temp'],
    color=alt.Color('max_temp:Q', sort='descending', scale=alt.Scale(scheme=alt.SchemeParams(name='redyellowblue')))
).interactive()

In [0]:
# This is a more complex example.  This plot allows us to select a region of data and produce a histogram
brush = alt.selection(type='interval')

points = alt.Chart(plot_data).mark_point().encode(
    alt.X('date:T',
          scale=alt.Scale(zero=False)
    ),
    alt.Y('max_temp:Q',
          scale=alt.Scale(zero=False)
    ),
    color=alt.condition(brush, 'max_temp:Q', alt.value('lightgray'),sort='descending', scale=alt.Scale(scheme=alt.SchemeParams(name='redyellowblue')) ),
).add_selection(
    brush
)

bars = alt.Chart(plot_data).mark_bar().encode(
    alt.Y('count()',scale=alt.Scale(domain=[0, 60])),
    alt.X('max_temp:Q', scale=alt.Scale(domain=[0, 40])),
    color='max_temp:Q'
).transform_filter(
    brush
)

points & bars

In [0]:
import altair as alt
from vega_datasets import data

source = data.seattle_weather()

scale = alt.Scale(domain=['sun', 'fog', 'drizzle', 'rain', 'snow'],
                  range=['#e7ba52', '#a7a7a7', '#aec7e8', '#1f77b4', '#9467bd'])
color = alt.Color('weather:N', scale=scale)

# We create two selections:
# - a brush that is active on the top panel
# - a multi-click that is active on the bottom panel
brush = alt.selection_interval(encodings=['x'])
click = alt.selection_multi(encodings=['color'])

# Top panel is scatter plot of temperature vs time
points = alt.Chart().mark_point().encode(
    alt.X('monthdate(date):T', title='Date'),
    alt.Y('temp_max:Q',
        title='Maximum Daily Temperature (C)',
        scale=alt.Scale(domain=[-5, 40])
    ),
    color=alt.condition(brush, color, alt.value('lightgray')),
    size=alt.Size('precipitation:Q', scale=alt.Scale(range=[5, 200]))
).properties(
    width=550,
    height=300
).add_selection(
    brush
).transform_filter(
    click
)

# Bottom panel is a bar chart of weather type
bars = alt.Chart().mark_bar().encode(
    x='count()',
    y='weather:N',
    color=alt.condition(click, color, alt.value('lightgray')),
).transform_filter(
    brush
).properties(
    width=550,
).add_selection(
    click
)

alt.vconcat(
    points,
    bars,
    data=source,
    title="Seattle Weather: 2012-2015"
)