# Visualization notebook
by N.G. 

## Introduction

This notebook seeks to visualize data using two different graphs: a scatter plot and a time-series histogram. Visualizing data is a simple but powerful way of exploring information. The data collection we are working with is vast. This provides a convenient and instant way to analyze and identify patterns, making sense of all the data. Through this, further research is spurred in the right direction. 


## Data

These graphs are based on the Ingenium open data collection. This is a catalogue of all artifacts from the Canada Agriculture and Food Museum, Canada Aviation and Space Museum, and the Canada Science and Technology Museum. It is available in CSV format so it was convenient to use in a Jupyter notebook. There were a few limitations in this collection. There were certain missing values that limited what topics could be explored. The descriptions of the artifacts were not uniform; a user would need to refine the information to be more consistent to make use of that information in a visualization. Otherwise this is a dataset with numerous and diverse columns to explore. 

## Preparing our data 

Let's set up the notebook. Our first objective is to install Bokeh. This is the Python library that allows us to create interactive visualizations. 

In [22]:
!pip install bokeh pyproj



Let's load up the data now! The .csv file name is plugged into the brackets; this tells the notebook that we want to pull data from this file. 

The notebook will also display the dataset and all of its columns. 

In [2]:
import pandas as pd

#below, we are setting the Ingenium data collection as our dataframe 

df = pd.read_csv('cstmc-CSV-en.csv')

# this produces an abridged version of our data
# definitely browse it and get a feel for the data we will be working with

print(df)

  interactivity=interactivity, compiler=compiler, result=result)


       artifactNumber              ObjectName      GeneralDescription  \
0       1966.0001.001                   Cover                   PAPER   
1       1966.0002.001          Stamp  postage                   PAPER   
2       1966.0003.001          Stamp  postage                   PAPER   
3       1966.0004.001          Stamp  postage                   PAPER   
4       1966.0005.001          Stamp  postage                   PAPER   
...               ...                     ...                     ...   
108458  2017.0005.002                Joystick     Synthetic and metal   
108459  2017.0005.003            Power supply     Synthetic and metal   
108460  2017.0005.004      Cord  power supply     Synthetic and metal   
108461  2017.0005.005  Case  storage-carrying     Synthetic and metal   
108462  2017.0006.001             Salinometer  Synthetic  metal  wood   

                               model SerialNumber Manufacturer ManuCountry  \
0        WESTERN CANADA AIRWAYS LTD.         

There is another way to present the dataset. Instead of printing a portion of actual data, we can create a more concise list of columns in the dataset. This allows the user to further organize and grasp the material. This list feature is especially useful for selecting variables to visualize.

In [3]:
#this lists the columns of data available to us. 

df.columns.tolist()

['artifactNumber',
 'ObjectName',
 'GeneralDescription',
 'model',
 'SerialNumber',
 'Manufacturer',
 'ManuCountry',
 'ManuProvince',
 'ManuCity',
 'BeginDate',
 'EndDate',
 'date_qualifier',
 'patent',
 'NumberOfComponents',
 'ArtifactFinish',
 'ContextCanada',
 'ContextFunction',
 'ContextTechnical',
 'group1',
 'category1',
 'subcategory1',
 'group2',
 'category2',
 'subcategory2',
 'group3',
 'category3',
 'subcategory3',
 'material',
 'Length',
 'Width',
 'Height',
 'Thickness',
 'Weight',
 'Diameter',
 'image',
 'thumbnail',
 'Unnamed: 36']

This collection of data is vast. We need to trim it down into a smaller data frame that is workable and relevant to us. This code block sorts the dataset into artifacts originating in Ottawa and Kingston. This refined Ottawa/Kingston dataframe will be used in the next section. 

In [4]:
# this collection of data is massive though. 
# we want to narrow the dataframe to the Ottawa/Kingston area. 
    
options = ['Ottawa', 'Kingston']  
    
ottking_df = df[df['ManuCity'].isin(options)]

print('\nResult dataframe :\n',
  ottking_df)


Result dataframe :
        artifactNumber               ObjectName  \
30      1966.0043.001  Therapy machine  cobalt   
31      1966.0043.002             Control unit   
70      1966.0065.001  Plate  equipment number   
87      1966.0083.001                Propeller   
88      1966.0084.001                Propeller   
...               ...                      ...   
108426  2016.0201.001          Telephone model   
108427  2016.0202.001          Telephone model   
108428  2016.0202.002              Casing part   
108431  2016.0202.005           Board  circuit   
108432  2016.0202.006           Board  circuit   

                                       GeneralDescription  \
30                    METAL  SYNTHETIC & WOOD COMPONENTS.   
31                    METAL  SYNTHETIC & WOOD COMPONENTS.   
70                                                  BRASS   
87                                                   WOOD   
88                                                   WOOD   
...         

## Visualizing numerical data

Our goal for this section is to create a scatter plot that represents numerical data. I have used length and width measurements to demonstrate but any numerical data can be used. The first action is to specify that we want to use data that is relevant to Ottawa and Kingston. We do this by invoking the dataframe that was created in the previous block. 

The next action is to plug in variables for the x-axis and y-axis. 

The couple lines of code following this are aesthetic but they are essential to organization and presentation of data and shouldn’t be overlooked. 

The final step in creating this graph is to implement the HoverTool function. This is useful for displaying further information about the data. When you hover your mouse at a specific point in the scatter plot, the HoverTool tells you what the object is and where it was manufactured. The HoverTool creates an interactive resource where users can really explore the dataset and find greater meaning within the data. 



In [21]:
#here we are importing the functions we need 
import pandas as pd
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import ColumnDataSource
from bokeh.models.tools import HoverTool
output_notebook()

# the '1000' refers to the amount of rows that we want the sample size to include
sample = ottking_df.sample(1000)
source = ColumnDataSource(sample)

#this is what dictates the graph
p = figure()
p.circle(x='Length', y='Width', 
         source=source,
         size=4, color='green') # experiment and change the size and colour
 
#we want to clearly communicate what the graph means 
p.title.text = 'Dimensions of materials produced in Ottawa and Kingston'
p.xaxis.axis_label = 'Length (in cm)'
p.yaxis.axis_label = 'Width (in cm)'

#hover your mouse over any point and check out more info

hover = HoverTool()
hover.tooltips=[
    ('Object', '@ObjectName'),
    ('Manufacturing city', '@ManuCity')
]

p.add_tools(hover)

show(p)

## Visualizing time-series data 

Our second objective is to create a time-series histogram. These are useful for when data must be analyzed over a period of time. Artifacts in this collection vary greatly in when they were manufactured. The origins of these artifacts span over two centuries. This graph will take inventory of how many artifacts were produced each year. 

Plotly is another library we can use to visualize data, similar to Bokeh. Let's install it. 

In [25]:
!pip install plotly 



The first line of code establishes that the variable BeginDate is a time-based column. 
The second line counts every artifact according to its BeginDate. 
The third line establishes that BeginDate is the x-axis of this histogram. 

In [24]:
import pandas as pd
import plotly.express as px 
output_notebook()

#this converts a variable to a date-time format 
#format can be altered to d/m/y or any variation of that. 
#errors=coerce was put in to stop an unnecessary error on my end

df['BeginDate'] = pd.to_datetime(df['BeginDate'], format='%Y', errors='coerce')

grouped = df.groupby('BeginDate')['artifactNumber'].count()

#right now every year is represented in the histogram 
#use the nbins function to broaden the measurements
fig = px.histogram(df, x="BeginDate") # nbins=40 represents production each decade 
fig.show()

## Conclusion 

To conclude, we have uploaded a collection of data to our notebook and displayed the columns of data in a chart. We have also successfully created interactive visualizations of data. In the first graph, numerical data was presented in a scatter plot. In the second graph, time-series data was presented in a histogram. Both of these feature the HoverTool which allows the user to browse over specific points and extract deeper meanings from the graphs. 

I highly encourage users to experiment with the HoverTool to create interactive experiences within their own visualizations. This is one of many examples of how people can explore and engage with history instead of passively observing it. I also encourage users to practise narrowing dataframes by specifying a few items within columns. This is a valuable way to focus your research. 

While I have demonstrated scatter-plot and time-series graphs, I have not demonstrated how to make a bar graph with qualitative values on the x-axis.   This is another simple visualization that would work with most datasets. 

The visualizations themselves are quite simple and lend themselves to use in other datasets. Find a dataset of your own and experiment with these simple code blocks by substituting the variables. What visualizations can you make? 


## Further Reading

[Histograms in Python] (https://plotly.com/python/histograms/) - this site is a guide to using the Plotly library to creating histograms. This covers the basics of building a visualization using Plotly and also demonstrates how to do more complex visualisations like cumulative histograms and stacked histograms. 

[Interactive Data Visualization in Python With Bokeh] (https://realpython.com/python-data-visualization-bokeh/) - this site walks the reader through every step of creating a data visualization. This guide doesn't take any part of the process for granted, doing well to explain every line of code that a user might be confused by. I found the section on making visualizations interactive to be especially useful. 

[15 Stunning Data Visualizations (And What You Can Learn From Them)] (https://visme.co/blog/examples-data-visualizations/) - this is a fun blog about data visualization that I would recommend taking a few minutes to browse. It shows some exciting examples of how data can be visualized and what benefits it can offer. The potential in this discipline is limitless and these examples inspired me to consider how else visualization could be used. 


## References

[Visualizing Data with Bokeh and Pandas] (https://programminghistorian.org/en/lessons/visualizing-with-bokeh) -my notebook is based on this guide by Charlie Harper. 

[Histograms in Python] (https://plotly.com/python/histograms/) - I used this guide to make my time-series graph. Bokeh doesn't support histograms so I was directed towards Plotly instead. 

I also must acknowledge the help of my class Discord group in helping me find solutions to various issues. 