![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Visualization Design

In this notebook we will work with an open dataset from Statistics Canada exploring the engagement of Canadians in outdoor activities. 

[Citation] Statistics Canada.  Table  38-10-0121-01   Participation in outdoor activities
DOI:   https://doi.org/10.25318/3810012101-eng[Contains information licensed under the Open Government Licence – Canada](https://open.canada.ca/en/open-government-licence-canada) 


We will guide the visualizations in this notebook based on the following guideline:


#### 1. Define a Clear Purpose

We are interested in answering the questions:

1. What outdoor activities to Canadians engage in? 

2. Are there activities more popular than others across Canadian cities? 

3. What percentage of Canadians engage in each of the different outdoor activities in 2011, 2013, 2015 and 2017? 

#### 2. Know the Audience

We will assume our audience is fluent in English, and has an understanding of what a province is, as well as what the different provinces in Canada are. 

#### 3. Use Visual Features to Show the Data Properly

Data visualizations we will use include: bar graphs to quantify average percentage of activity across all selected years, as well as line graphs to showcase the top ten cities in which a given activity is popular. 


#### 4. Keep It Organized and Coherent

Questions to ask yourself: Does your visualization look cluttered? Are you mixing data categories? Are you attempting to answer more than one question using a single visualization? 


#### 5. Make Data Visualization Inclusive
Questions to ask yourself: What are colour-blind-friendly colour palettes? Are you using them in your visualizations? If not, how can you incorporate them? Is the size of the visualization elements appropriate? What language does your audience use to communicate? Are there potential biases in your visualization that can be addressed? 

__________

Run the cell below to import modules. 

In [None]:
import plotly.express as px

import cufflinks as cf
cf.go_offline()

And the cell below to obtain the necessary functions to get data from Statistics Canada API. 

In [None]:
%run -i ./StatsCan/helpers.py
%run -i ./StatsCan/scwds.py
%run -i ./StatsCan/sc.py
%run -i ./widgets_libraries.py

The product ID associated with the dataset we are interested in is `38-10-0121-01`. Run the cell below to get the data. 

In [None]:
# # Download data 
productId = "38-10-0121-01"

if "-" not in productId:
    if len(productId)!=10:
        print("WARNING: THIS IS LIKELY A NUMBER NOT ASSOCIATED WITH A DATA TABLE. VERIFY AND TRY AGAIN")
        sys.exit(1)

        

else: 
    if len(productId.split("-")) !=4:
        print("WARNING: THIS IS LIKELY A NUMBER NOT ASSOCIATED WITH A DATA TABLE. VERIFY AND TRY AGAIN")
        sys.exit(1)

download_tables(str(productId))
df_fullDATA = read_data_compute_df(productId)

cols = list(df_fullDATA.loc[:,'REF_DATE':'UOM'])+ ['SCALAR_FACTOR'] +  ['VALUE']
df_less = df_fullDATA[cols]
df_less2 = df_less.drop(["DGUID"], axis=1)

df_less2.head()

### Question 1: What are the top ten cities engaging in a given activity?

Let's say I am curious to know the average percentage of Canadians who engage in the following activities:

1. Walking
2. Bicycling
3. Jogging, running, rollerblading, cross-country running
4. Hiking
5. Snowmobiling

Let's explore that usng the code below. 

Note: we obtained the categories above using the following command `df_less2["Participation in outdoor activities"].unique()`

First, start by running the cell below. 

The code in the cell below looks at the top 10 average percentage of Canadians who engage in walking.

In [None]:
activity='Walking'

df_less2[df_less2['Participation in outdoor activities']==\
         activity].groupby("GEO").agg(np.mean)[["VALUE"]].sort_values(by="VALUE")\
.iloc[0:10].iplot(kind='scatter',
                  yTitle="Average percentage",
                  xTitle="Location",
                  # Change color here
                  color='red',
                  # Change width here
                  width=4,
                  title='Average percentage of people who engage in ' +str(activity).lower() + " (top ten places)")



#### Activity

In the code cell above, change the content in the variable `activity` for one or more of the following values:

`"Bicycling"`

`"Jogging, running, rollerblading, cross-country running"`

`"Hiking"`

`"Snowmobiling"`

What are the top ten cities engaging in each?

____

### Question 2: What outdoor activities to Canadians engage in?

Given the dataset contains information on cities within each province, let's narrow down our search by specifying a province of interest. 

Run the cell below to see the average percentage of people who enage on a given activity in Ontario. 


In [None]:
location='Ontario'

new_table = df_less2[df_less2['GEO'].\
                     str.contains(location)].groupby('Participation in outdoor activities').\
                    agg(np.mean)[["VALUE"]].sort_values(by="VALUE")

new_table.iplot(kind='bar',
       yTitle="Average percentage",
       xTitle="Location",
        # Change color here
        color='red',
       title='Average percentage people engage in outdoor activities in ' +str(location))



#### Activity

In the code cell above, change the content in the variable `location` for one or more of the following values:

`"British Columbia"`

`"Alberta"`

`"Saskatchewan"`

`"Winnipeg, Manitoba"`

What are the top ten activities people engage in, for each of the locations? Are there activities no person was recorded engaging in?

____

### Question 3: What percentage of Canadians engage in each of the different outdoor activities in 2011, 2013, 2015 and 2017?

The code above allows us to visualize average percentages, however the dataset contains information from four years: 2011, 2013, 2015 and 2017. 

Let's pivot our table to take a look at this. 

Run the cell below to pivot our table. We will remove NaN values from our table. 


In [None]:
# Pivoting table
pivot_by_changes = df_less2.pivot_table(values="VALUE",
                     index=["GEO","Participation in outdoor activities"],
                     columns='REF_DATE')
# Drop NaN values
pivot_by_changes = pivot_by_changes.dropna()

# Display table
pivot_by_changes

Let's take a look at how the percentage of people engaging in different outdoor activities has changed for each year. 

Run the cell below to see these values of Ontario.

In [None]:
province = 'Winnipeg, Manitoba'
my_colors = ['red','blue','orange','green']
pivot_by_changes.xs(province).iplot(kind='bar',
                                     xTitle='Activity',
                                    yTitle='Percentage',
                                    color=my_colors,
                                    title='Percentage of people who engaged in outdoor activites, per year, in '+str(province))

In the code cell above, change the content in the variable `location` for one or more of the following values:

`"British Columbia"`

`"Alberta"`

`"Saskatchewan"`

`"Winnipeg, Manitoba"`

What percentage of people engaged in outdoor activities? How has this percentage changed over time? 
____

### Making our data inclusive

As a final exercise, visit https://colorbrewer2.org/ and change the colours within the plots to make them colourblind friendly. 



[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)