<a href="https://colab.research.google.com/github/Doongka/GHDColabExamples/blob/master/Visualising_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Visualization Example
The goal of this notebook is to provide a quick overview of some useful Python tools used for cleaning, analysing and visualizing data.

It has has been created using Google Colab, a free cloud-based notebook environment that allows you to write and execute Python code without needing to set up your own local Python environment.

This notebook will allow you to run the code in segments as you go.  When you get to a code block, you can execute (run) the code by pressing CTRL-ENTER.  

Try it on the code block below:-

In [0]:
# assign "Hello World!" to a new variable "greeting"
greeting = "Hello World!"

# run the "print" function with "greeting" as an argument. i.e. Print to screen whatever is stored in "greeting"
print(greeting)

Hello World!


If successful you will see the words "Hello World!" output below the code.

Something important to note is that a # symbol at the start of a line causes Python to ignore that particular line. This is useful to provide a description of your code like I've done above or to stop lines of code from running with debugging. A quick way to "comment" and "uncomment" your code is to press CTRL-/



## Prepare the workspace
Google Colab has many useful Python packages preinstalled such so there shouldn't be any need to install them yourself.  Packages are basically collections of useful functions.  If case you do need something that is not installed, you can run a "pip" installation as per below.  

You can see that I've "commented" the bit of code that installs the package "geopandas".  Uncomment this line and run the code block.

In [0]:
# We place a ! in front of the command to indicate that we want this to run in a console.
# !pip install geopandas

## Uploading your data to Colab
The first step in any data analysis is to prepare the data.  As we are running this in Google Colab, we need to upload any datasets.  There are several ways to do this but for this example I've used the most intuitive.  

After running the cell below, you will see a "Choose Files" button that will allow you to select the files you wish to upload.  This example has been designed to handle a single CSV file.


In [2]:
from google.colab import files

uploaded = files.upload()

Saving brisbanetemp.csv to brisbanetemp.csv


Running the next cell will give you an overview of the files you have just uploaded.

In [3]:
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

User uploaded file "brisbanetemp.csv" with length 590942 bytes


## Wrangling your data

Python has many pre-built packages that help with any extract, transform and load (ETL) operations required to get the data ready.

Pandas is a popular library that provides many useful methods for data manipulation. We will be using the "Dataframe" functionality which will allow us to import data from a CSV file and perform several data wrangling and cleansing tasks.



In [0]:
#These commands allow us to load the libraries in to our current notebook
import pandas as pd

## Loading your data
Before we can do anything with our data, we first need to load it into memory.  Luckily Pandas has functions that can read files and load them into a useful format.

In [5]:

data = pd.read_csv('brisbanetemp.csv')

data.head()

Unnamed: 0,date,maximum temperature (degC),minimum temperature (degC),site number,site name
0,,,,40842.0,BRISBANE AERO
1,4/06/1949,14.0,2.4,,
2,5/06/1949,18.2,10.0,,
3,6/06/1949,21.0,5.6,,
4,7/06/1949,20.5,7.1,,


# Cleaning up your data
You can see from the table above that row 0 has data that is unable to be read.  "NaN" means "Not a Number" and in this case was caused by empty cells in the first row of the data.  Let's remove this cell using the following command. 

In [27]:
cleansed_data = data.drop(df.index[0])
cleansed_data.head()

Unnamed: 0,date,maximum temperature (degC),minimum temperature (degC),site number,site name
1,4/06/1949,14.0,2.4,,
2,5/06/1949,18.2,10.0,,
3,6/06/1949,21.0,5.6,,
4,7/06/1949,20.5,7.1,,
5,8/06/1949,20.5,6.1,,


In [28]:
cleansed_data = cleansed_data.drop(['site number', 'site name'], axis=1)
cleansed_data.head()

Unnamed: 0,date,maximum temperature (degC),minimum temperature (degC)
1,4/06/1949,14.0,2.4
2,5/06/1949,18.2,10.0
3,6/06/1949,21.0,5.6
4,7/06/1949,20.5,7.1
5,8/06/1949,20.5,6.1


In [29]:
cleansed_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25564 entries, 1 to 25564
Data columns (total 3 columns):
date                          25564 non-null object
maximum temperature (degC)    25537 non-null float64
minimum temperature (degC)    25525 non-null float64
dtypes: float64(2), object(1)
memory usage: 798.9+ KB


In [30]:
missing_data = cleansed_data['maximum temperature (degC)'].isna()
missing_data = missing_data | cleansed_data['minimum temperature (degC)'].isna()
missing_data.sum()

63

In [31]:
cleansed_data.loc[missing_data,:]

Unnamed: 0,date,maximum temperature (degC),minimum temperature (degC)
623,16/02/1951,,17.9
624,17/02/1951,,17.6
625,18/02/1951,,16.4
626,19/02/1951,,20.7
627,20/02/1951,,21.6
...,...,...,...
10288,3/08/1977,21.6,
10822,19/01/1979,,22.4
11223,24/02/1980,29.4,
14000,2/10/1987,23.2,


In [37]:
# final_data = cleansed_data.fillna(method='bfill')
# final_data = cleansed_data.fillna(method='ffill')
final_data = cleansed_data.interpolate(method='linear')
final_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25564 entries, 1 to 25564
Data columns (total 3 columns):
date                          25564 non-null object
maximum temperature (degC)    25564 non-null float64
minimum temperature (degC)    25564 non-null float64
dtypes: float64(2), object(1)
memory usage: 798.9+ KB


In [40]:
final_data['date'] = pd.to_datetime(df['date'],format="%d/%m/%Y")
final_data.info()
final_data.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25564 entries, 1 to 25564
Data columns (total 3 columns):
date                          25564 non-null datetime64[ns]
maximum temperature (degC)    25564 non-null float64
minimum temperature (degC)    25564 non-null float64
dtypes: datetime64[ns](1), float64(2)
memory usage: 798.9 KB


Unnamed: 0,date,maximum temperature (degC),minimum temperature (degC)
1,1949-06-04,14.0,2.4
2,1949-06-05,18.2,10.0
3,1949-06-06,21.0,5.6
4,1949-06-07,20.5,7.1
5,1949-06-08,20.5,6.1


## Visualizing your Data


Altair is a great data visualization library that is preinstalled in Colab.  It is a statistical visualization language w
These have a wide range of graph and data visualization types available


In [0]:
import altair as alt

from vega_datasets import data

source = data.seattle_weather()


interval = alt.selection_interval()

subset = data[data.Location=="Brisbane"]

subset.keys

# scale = alt.Scale(domain=['sun', 'fog', 'drizzle', 'rain', 'snow'],
#                   range=['#e7ba52', '#a7a7a7', '#aec7e8', '#1f77b4', '#9467bd'])
color = alt.Color('weather:N', scale=scale)

# We create two selections:
# - a brush that is active on the top panel
# - a multi-click that is active on the bottom panel
brush = alt.selection_interval(encodings=['x'])
click = alt.selection_multi(encodings=['color'])

# Top panel is scatter plot of temperature vs time
points = alt.Chart().mark_point().encode(
    alt.X('monthdate(date):T', title='Date'),
    alt.Y('temp_max:Q',
        title='Maximum Daily Temperature (C)',
        scale=alt.Scale(domain=[-5, 40])
    ),
    color=alt.condition(brush, color, alt.value('lightgray')),
    size=alt.Size('precipitation:Q', scale=alt.Scale(range=[5, 200]))
).properties(
    width=550,
    height=300
).add_selection(
    brush
).transform_filter(
    click
)


# Bottom panel is a bar chart of weather type
bars = alt.Chart().mark_bar().encode(
    x='count()',
    y='weather:N',
    color=alt.condition(click, color, alt.value('lightgray')),
).transform_filter(
    brush
).properties(
    width=550,
).add_selection(
    click
)

alt.vconcat(
    points,
    bars,
    data=source,
    title="Seattle Weather: 2012-2015"
)

AttributeError: ignored