<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Principles of Data Visualisation With Python

---

#### Navigation

[Part 1: Data Visualisation for Communication](#part1)

[Part 2: Data Visualisation for Exploration](#part2)

## Icebreaker

With your neighbour discuss:

- what's the best piece of data visualisation you've seen and why?

If you can't think of anything off the top of your head, try:

- where (website, newspaper etc.) have you seen the best data visualisations?

# Housekeeping

- Unit 2 assessment: out now, and due approx. Tuesday 5th June

- Coming up: stats, "mini group project", then machine learning!

- Project lightning talks in **2 weeks**

# Final Projects

#### Purpose

- for you to showcase what you can do!

- to practise end-to-end data analysis on "real" data that you find yourself

- the purpose of your work should be to **inform some sort of decision** and **make a recommendation**

- **don't** just predict something because you like machine learning!

#### Structure

- Proposal

- Brief

- Report & Presentation

**Proposal**: Describe your chosen problem and identify relevant data sets

- find 2-3 problems of interest (to you or your company)

- think of what data you need and what sort of hypotheses you'd want to evaluate, or what things you'd predict

- deliverable: 3-4 minute lightning talk

- due Tuesday 5th June

**Brief**: Share a summary of your initial analysis and your next steps with your instructional team.

- obtain your dataset and perform exploratory data analysis on it

- think about how your approach needs to change based on your findings

- deliverable: short progress report ("1 page" is fine!)

- due end of Unit 3, so approx. 21st June

**Report**: Submit a cleanly formatted Jupyter notebook (or other files) documenting your code and process for technical/peer stakeholders.

**Presentation**: Present a summary of your business problem, approach, and recommendation to an audience of non-technical executive stakeholders.

- more info on these deliverables later

- final presentations (~10 minutes each) on final week: Tuesday 10th & Thursday 12th July

### Example: "Hot dog or not hot dog?" classifier

**Lightning talk**: I want to classify hot dog vs. not hot dog by building a model that recognises hot dogs in Instagram posts.

Hot dog vendors could then use this to see hot dog-related activity on Instagram.

For this I'll need pictures of hot dogs from Instagram.

My other project idea is "bagel or not bagel?". For that I'll need pictures of bagels.

**Brief**: based on feedback from my teaching team, turns out this is a hard problem.

However, I can use the hashtag #hotdog and extract metadata from those images to get time & location.

My project is now not an image classifier but an analysis of when and where people take pictures of hot dogs.

**Report & Presentation**: based on my analysis, it seems like people buy hot dogs and post them on Instagram on weekends more than weekdays.

Based on this, and the analysis of geo-tag data, I would recommend hot dog vendors target large parks on Saturdays.

To get you thinking have a look at the student gallery: [https://gallery.generalassemb.ly/DS](https://gallery.generalassemb.ly/DS)

Some other examples:

- cruise ship recommendation engine
- "how do I improve my fantasy football team?"
- customer churn model on the company's real data

# Why do we bother visualising data?

### Why do we bother visualising data?

- human brains are wired to process more information visually

- a universal way to convey information

- attractive to look at

- often the best way to describe/analyse data **as the amount of data increases**

### Purposes of Data Visualisation

#### Communication

#### Exploration

### Learning Objectives

#### Part 1

- Describe why data visualisation is important.
- Identify the characteristics of a great data visualisation.

<a id="part1"> </a>

# Data Visualisation for Communication

### Bad Data Visualisation

<table style="border-width:0px">
    <tr style="border-width:0px"><td style="border-width:0px"><img src="assets/images/bad_datavis_1.png" /></td><td style="border-width:0px"><img src="assets/images/bad_datavis_2.png" /></td></tr>
    <tr style="border-width:0px"><td style="border-width:0px"><img src="assets/images/bad_datavis_3.png" /></td><td style="border-width:0px"><img src="assets/images/bad_datavis_4.png" /></td></tr>
</table>

![](assets/images/worst_piechart.jpg)

<img src="assets/images/hotdogs.jpg" style="width:60%" />

# Good DataVis?

![](assets/images/economist-map.JPG)

For good examples of DataVis you can try:

- [Information is Beautiful](https://www.informationisbeautifulawards.com)
- [FlowingData](http://flowingdata.com)

## What are some visual attributes that we can use to visualise data?

i.e. how can we visually convey the difference between two things that are different in our data?

Let's take a look at what Jeffrey Shaffer, who teaches data visualisation at the University of Cincinnati, thinks:

![](assets/images/data%20attributes.png)

Which ones do you think are easier/harder for humans to perceive?

## Stevens' Power Law

<img src="assets/images/stevens-power-law.PNG" style="width:70%; border-width: 0px" />

# Colour

Generally, in data visualisations, you’re going to use colour in one of **three** ways.

## Sequential

Sequential colors are used to show values ordered from low to high.

<img src="assets/images/sequential.png" style="width:55%" />

Which of the types of data (nominal, ordinal, interval, ratio) would this be suitable for?

## Divergent

Divergent colours are used to show ordered values that have a critical midpoint, like an average or zero.

<img src="assets/images/divergent.png" style="width: 45%" />

Which of the types of data (nominal, ordinal, interval, ratio) would this be suitable for?

## Categorical

Categorical colours are used to distinguish data that falls into distinct groups.

<img src="assets/images/categorical.png" style="width: 50%" />

Which of the types of data (nominal, ordinal, interval, ratio) would this be suitable for?

<a id="part2"> </a>

# Data Visualisation for Exploration

### Learning Objectives
- Describe when you would use a bar chart, pie chart, line chart, and scatter plot
- Practise creating plots of your data using `matplotlib`

<a id='anscombe'></a>

Below are the summary statistics for four plots. What do you think the visualisation for each plot would look like? 

![summary statistics for four different plots](assets/images/anscombe_dataset.png)

### Anscombe's Quartet

<img src="assets/images/anscombe.png" style="width:70%" />

<a id='chart_choice'></a>

# Choosing the Right Chart

### Bar Charts

Bar charts make it easy to compare information, revealing highs and lows quickly

Bar charts are most effective when you have numerical data that splits neatly into different categories

![](./assets/images/bar%20chart.png)

### Pie Charts

Pie charts are only useful to show relative proportions or percentages of information, but are both **overused** and **misused**.

After 2-3 slices pies become useless. 

Best to avoid them entirely.

### The Best Use of a Pie Chart

![](http://i.imgur.com/uhTf6Ek.jpg)

### Scatter Plots

Scatter plots are a great way to give you a sense of trends, concentrations, and outliers.

![](./assets/images/scatter%20plot.png)
[Scatter plot via Wikibooks](https://en.wikibooks.org/wiki/Statistics/Displaying_Data/Scatter_Graphs)

### Line charts

Line charts are used for when there's a temporal element to your data.

![](assets/images/xkcd-chart.png)

[xkcd #418](https://xkcd.com/418)

There is no better or worse chart (except pies, they're the worst).

You should consider which one is most appropriate for representing a particular data set.

[Which chart is right for you? (via Tableau)](https://drive.google.com/file/d/0Bx2SHQGVqWasT1l4NWtLclJJcWM/view)

## Visualisation Programming Libraries

In this course, we will mostly use the Python library [Matplotlib](https://matplotlib.org/) but also see examples of using [Seaborn](https://seaborn.pydata.org/) which has some additional plots and options.

Many other Python libraries exist for making visualisations. Some of the most popular include:

- **[Bokeh](http://bokeh.pydata.org/en/latest/):** Python visualisation library that targets the web browser (e.g., in Jupyter). Makes interactive plots, dashboards, data applications, etc.

- **[Graphviz](http://graphviz.readthedocs.io/en/stable/manual.html):** Popular visualization library for graph data structures (e.g., edges, vertices, etc). Has Python extensions.

- **[Basemap](http://matplotlib.org/basemap/):** Python Matplotlib extension for drawing static maps. There are many other Python libraries for plotting geographic data, including [folium](https://github.com/python-visualization/folium).

One of the most popular libraries for interactive visualizations in the web browser is D3. Because web browsers only natively run JavaScript, D3 requires knowledge of JavaScript:

- **[D3.js](https://d3js.org/):** JavaScript library for interactive web visualizations [D3.js](https://d3js.org/) | [Examples](https://github.com/mbostock/d3/wiki/Gallery)

### Other Visualization Tools

Although this course emphasizes a Python approach to data science, a variety of non-programming tools are also used in industry. Often, these tools can be applied much more quickly than creating a custom Python solution. For example:

- **Excel:** For quick data cleaning and simple graphs
- **Power BI:** A suite of business analytics tools
- **Tableau:** Business intelligence and analytics software
- **Periscope Data:** Data analysis platform
- **Plotly:** Create charts and dashboards
