# Critique of a Data Product (Annual Report)

> Communicating about data is a special art...
- toc: true
- branch: master
- badges: false
- comments: true
- hide: false
- search_exclude: true
- metadata_key1: metadata_value1
- metadata_key2: metadata_value2

## Purpose
This project involves the critique of the format and contents of the annual report of a non-profit. The annual report is about the operations of an initiative known as MissionTran. The need that the client has is to evaluate the current structure and contents of the annual report in the light of best-practice data communication standards, as well as obtain recommendations to improve the data communication aspects of it. A copy of the annual report may be found [here](/portfolio/AnnualReport2018.pdf).

## Description of the data product

The MissionTran project uses a team of volunteers to proofread machine-translated sermons into a number of languages. Users proofread and correct machine-translated sentences one-by-one. Each proofreading effort to a sentence is called a *contribution* and comes in the form of an *edit*.

There are two kinds of contributions: *Votes* and *Creates*. A vote contribution happens when the user decides to simply vote for the provided translation, i.e. the user deems the translation to be correct. A create contribution happens when the user edits the translation to correct it.

There are a number of *bases* upon which a contribution is made. For *vote* contributions, the bases are:
- By *accepting* the machine-translated sentence as is (‘a’)
- By *creating* a new edit (‘c’)
- By *topping*, i.e. by voting for the current edit with the most votes (‘t’)
- By *picking* another edit to vote for (‘p’)

For create contributions the bases are:
- By *clearing* (‘k’)
- By *modifying* (‘m’)

A number of metrics are collected by the system:
- time used to make a vote contribution 
- time used to make a create contribution 
- vote time spent to proofread a complete translation (assignment)
- create time spent to proofread a complete translation (assignment)

The components of the system may be summarized as follows:
- a Translation has many
    - Sentences has many
        - Edits has many
            - Contributions
- a Translation has many
    - Assignments has many
        - Contributions
- A User has many
    - Assignments

## Finding Your Purpose and Message

### Actionable data

The location of *actionable data* in the annual report is not straightforward. Most charts seem to simply dump data. Only some charts seem to suggest its use as a trigger for action. One such chart is the report's Figure 12 which shows the number of contributions by username by contribution kind:

![](../images/dpc_fig12.png "Figure 12:  Number of contributions by username by contribution kind")

This chart shows how productive users were by showing the total number of contributions each made. This number is further broken down into *vote* contribution and *create* contributions.

### Balance

There should be a balance between the representations for the *data*, the *author*, and the *audience*. In the report, the balance is tilted strongly in the direction of the data. There is a strong impression that all available data has simply been dropped into the document. The consumer is then expected to pick out pieces for consumption. 

The document does not reveal much about the objectives of the *author*, except maybe to communicate the progress of the translation team. 

There is an implication that the *audience’s* need is simply to know the progress of the project.

### Audience factors

The *role* of the audience seems to be stakeholders interested in the status of the project. There is hardly any structure that would make it easy and direct to answer high-priority questions. 

The *workflow* of the audience might be catered for as the information is presented in the form of charts that can be accessed easily in an office setting. 

There is no elaboration on the *data comfort and skills* of the audience. The author does point out that “The workings of the translation system will not be presented here as most stakeholders are well aware of the details.” This suggests that the audience possess the *industry and data expertise* and does not need embedded explanations.

## Information Discrimination

There is no identifiable core problem/theme. The report is simply a series of charts that try to convey the performance of the team. There is not a good separation of *reporting* from *exploration*. Some charts, like Figure 10, are totally unnecessary. This chart should have at least been pushed to an appendix but there is no appendix. Here is Figure 10:

![](../images/dpc_fig10.png "Figure 10:  Joint distribution of assignment total vote/create time")

Figure 10 seems much too technical and does not belong with the remaining charts.

There is a section (at the end) on “Day of Week.” The chart in this section (Figure 13), breaks down contribution effort (in seconds) by day-of-week by contribution kind:

![](../images/dpc_fig13.png "Figure 13:  Distribution of effort by day-of-week by contribution kind")

It seems unnatural to do this kind of breakdown. Why would distribution of effort per contribution vary with day of week. I cannot think of a fundamental reason why a user’s need for time to proofread a specific sentence would vary by day-of-week. This is not a meaningful chart. It should be thrown out. 

With a stretch of imagination one could think of the following scenario. Let us say on Mondays, there are always a series of celebrations in the large conference room next door. This might be distracting to the proofreaders. This kind of scenario, however, should be investigated in the 'backroom'. It should not be part of regular annual report.

If the author happens to insist on keeping this chart, it should be changed considerably. Its current form hardly shows the required information. The presence of a few outliers compresses the real data so much that nothing is revealed. Here is a first suggestion. The weekdays are ordered, the axis labels are improved, and a measure of transparency is given to the markers. This helps to show how much overlap of points there are:

![](../images/dpc_fig13_replacement1.png "Figure 13-1:  Replacement suggestion 1")

Even better, we can provide a small amount of sideways random jitter to help reveal the overlap:

![](../images/dpc_fig13_replacement2.png "Figure 13-2:  Replacement suggestion 2")

If we remove the outliers, the chart becomes more effective:

![](../images/dpc_fig13_replacement3.png "Figure 13-3:  Replacement suggestion 3")

We may also use a boxplot which is commonly used in a case like this:

![](../images/dpc_fig13_replacement4.png "Figure 13-4:  Replacement suggestion 4")

A violin plot will also be effective:

![](../images/dpc_fig13_replacement5.png "Figure 13-5:  Replacement suggestion 5")

The latest plots all reveal that there is no real variation in contribution effort per sentence over day-of-week. This was certainly not evident from the original chart.

## Defining Meaningful and Actionable Metrics

The area of metrics is the strong point of the document. Metrics are: 
- not too *simplistic*
- not overly *complex*
- not too *many*
- none are *vanity* metrics

The metrics have a *common interpretation*, e.g.
-	time used to make a vote contribution 
-	time used to make a create contribution 
-	vote time spent to proofread a complete translation (assignment)
-	create time spent to proofread a complete translation (assignment)
-   total vote time spent during an assignment.

Metrics are mostly *actionable*; e.g. how much time is spent per sentence.

Metrics are *accessible* and also *transparent* with simple calculations.

## Creating Structure and Flow to your Data Products

There is a simplistic logical structure and *no narrative* at all – very poor!

The author could have chosen one of many narrative flows, e.g. to 
-	show how machine-translated assisted proofreading is more efficient than humans translating from scratch
-	show how the productivity of users varies by language
-	show the work patterns of users, i.e. a steady amount each workday, or only on one day per week, etc.

There is no meaningful flow however. There is no notion of the “guided safari” storytelling, or even of a traditional story. This is a glaring weakness in the data product.

## Designing Attractive, Easy-to-understand Data Products

The presentation looks rough and unfinished (low aesthetic value). Given the author’s comment about the stakeholders’ understanding of the workings of the system, this might be overlooked somewhat. 

There is no need for *connectivity*, *data detail*, *interactivity*, and *mobility* as this is an annual report. 

The use of *color* in charts was mostly acceptable.

The author often did not choose the most appropriate chart type, e.g. in Figures 1 and 2 a line chart should have been used instead of the heat maps:

![](../images/dpc_fig1.png "Figure 1:  Contributions")

If the author insists on having a heat map, it could have been made simpler and more effective:

![](../images/dpc_fig1_replacement.png "Figure 13-1:  Replacement suggestion")

The chart type for Figure 6 was chosen poorly and the presence of a few outliers hide most of the real data:

![](../images/dpc_fig6.png "Figure 6:  Distribution of effort by contribution kind by sentence type")

The following chart would have been more effective (after removing the outliers):

![](../images/dpc_fig6_replacement.png "Figure 6-1:  Replacement suggestion")

## Creating Dialogue with Your Data Products

In general, there is not a lot of *chart junk*. The author mostly used sufficient *contrast*. *Readability of labels* are generally bad. A serious omission is that no chart has a heading! For example, Figure 11:

![](../images/dpc_fig11.png "Figure 11:  Distribution of effort")

The time units are always in seconds which is often inappropriate. Although the meaning of variables is explained, it is bothersome that they appear in their “software” form on axes, as in Figure 8:

![](../images/dpc_fig8.png "Figure 8:  Assignment total create time versus translation sentence count")

Sentence types are not defined anywhere.

*Sorting for comprehension* could have been done in Figure 5 and 12:

![](../images/dpc_fig5.png "Figure 5:  Distribution of effort by language by sentence type")

![](../images/dpc_fig12.png "Figure 12:  Number of contributions by username by contribution kind")

In addition, monotone *color variants* could have been used in Figure 12.

## Summary

The annual report showcases a lot of data. The disappointing part is that it has been done very ineffectively and unprofessionally.

The main weaknesses are:
-	No purpose or message
-	Use of data indiscriminately
-	Weak structure and no narrative flow 
-	No mentionable design
-	Will not trigger conversation and dialogue in a straightforward way

The one strength:
-	Meaningful metrics

# 6 CONCLUSIONS & RECOMMENDATIONS

We have demonstrated how flight profile time-series can be turned into images for more effective classification. Then we identified the best technique to transform a profile time-series into an image for use by the classification process. Using transfer learning, we showed how quickly a deep learning model could be trained. We conclude that the need for hand classification of flight profiles has been reduced greatly leading to significant time savings potential for post-flight analysts. Making use of a publicly available rich flight data set we hope this work will encourage ordinary data scientists (which do not have access to company flight data) as well as post-flight analysts (that do have access) to undertake studies in this important area.

We recommend that interested analysts take this work as a starting point and adapt it to suit their needs. This may even involve changing the technology stack, for example, making use of other deep learning libraries and a different programming language. Here we used the fastai Python library (built on top of PyTorch) and the Python language. There are a number of other useful technology environments, e.g. Java, TensorFlow, Julia, and MATLAB.

We suggest that analysts need not shy away from the use of deep learning for post-flight analysis. The use of transfer learning makes the training of deep learning models very tractable. In our case, transfer learning was based on the ImageNet model which was trained on over 14 million images to classify them into more than 20,000 categories. There are many cloud providers offering the use of GPUs (Graphical Processing Units), ideal for the training process. GPUs are not necessary for inference. Even without access to a GPU, the training process is still tractable on an ordinary laptop. This is the beauty of transfer learning.

# 7 SUMMARY

In this work, we:
* Performed a comparison of a number of transformation techniques in terms of their associated image classification performance. We applied each transformation technique to the cleaned time-series dataset in turn, trained a CNN to do classification (using supervised learning), and recorded the performance. Then we selected the most performant transformation technique and used it in the rest of the analysis pipeline. The following transformation techniques were considered:
    * Altitude line plots transformed into an image
    * Altitude area plots transformed into an image
    * Gramian Angular Summation Field (GASF)
    * Gramian Angular Difference Field (GADF)
    * Markov Transition Field (MTF)
    * Recurrence Plot (RP)
* Trained a model to classify flight profiles into developed (useful) and non-developed (non-useful) profiles. We also considered the use of anomaly detection by means of an autoencoder (instead of a classification algorithm) due to the significant class imbalance and concluded that the autoencoder did not work as well as the classifier.
* Trained a model to do multi-label classification of developed profiles. The labels reflected whether a profile had a canonical climb/cruise/descent segment.
* Trained a model to classify flight profiles with canonical cruise segments (regardless of the properties of climb or descent segments) into profiles that have extended cruises (useful) and shorter cruises (non-useful).
* Prepared a significant test dataset, consisting of datapoints that have never been seen by any of the models and have not been labeled. We constructed an end-to-end analytic inference process to simulate a production system and applied it to the test dataset. Finally, we made recommendations to post-flight and other interested analysts.

# 8 FURTHER EXPERIMENTATION

There are a good number of hyper-parameters that may be adjusted leading to further experiments, for example:
* Fraction of train data dedicated for validation (20% here)
* The batch size during training (32 for mod3a and 8 for mod2 and mod1)
* Images were resized to 128 x 128. A technique that holds promise is to first down-sample images drastically, say to 32 x 32. Then, after training, transfer learning is used while progressively up-sampling again.
* Learning rates. This is arguably the most influential hyper-parameter during training. It may be worthwhile to adjust the used learning rates, (both for the frozen learning rate, *lrf*, as well as the unfrozen learning rate, *lru*.

We used data from a single airplane in this study. It may be worthwhile to analyze the datasets available and put together a dataset that represents a number of different aircraft. Care should be taken, however, because the identity and model of aircraft are deliberately omitted. This means the analyst might end up with data from a 747 being mixed with that of a Lear Jet, or even a Cessna, probably not all that useful.

No *data augmentation* was used during training. It should be possible to generate more data by flipping images horizontally for mod3a and mod1. This will not work for mod2 due to the implied change of the segment types. A small amount of zooming might also be tried.

The CNN architecture used throughout was ResNet-50. We believe ResNet is the current state-of-the-art but there are other promising architectures, i.e. the Inception network. If an experimenter is challenged in terms of compute-resources, the ResNet-50 can be scaled down, e.g. to ResNet-34 or ResNet-18.

In the case of the autoencoder, a *linear* autoencoder was used. Better results might be obtained if a *convolutional* autoencoder is used instead. The code for a convolutional autoencoder is included in the anomaly detection notebook.

Finally, we need to mention that, although this paper focused on only using the altitude time-series of a flight, there are many more variables to explore. As mentioned, our data source makes 186 variables available. A few simple explorations are undertaken in the Python notebook:

[20_eda1.ipynb](https://nbviewer.jupyter.org/github/kobus78/dashlink/blob/master/20_eda1.ipynb)

Something interesting that might be tried is to combine a number of normalized variables (to allow for a single vertical scale) on a single line plot, each in a different color. Deep learning may then be used to train for the identification of normal versus anomalous situations.

# REFERENCES

Autoencoders. (n.d.). Retrieved September 28, 2019, from http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/

Automatic dependent surveillance – broadcast. (n.d.). In Wikipedia. Retrieved September 19, 2019, from   http://en.wikipedia.org/wiki/https://en.wikipedia.org/wiki/Automatic_dependent_surveillance_%E2%80%93_broadcast

Bagnall, A., Lines, J., Bostrom, A., Large, J., & Keogh, E. (2017). The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery, 31(3), 606-660.

Bagnall, A., Lines, J., Hills, J., & Bostrom, A. (2016). Time-series classification with COTE: The collective of transformation-based ensembles. International Conference on Data Engineering, pp 1548-1549.

Culurciell, E. (2018). The fall of RNN / LSTM. [Weblog]. Retrieved from 
https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0

Esling, P., & Agon, C. (2012). Time-series data mining. ACM Computing Surveys, 45(1), 12:1-12:34.

Fawaz, H. I., Forestier, G., Weber, J., Idoumghar, L., & Muller, P. (2019). Deep learning for time series classification: a review. Data Mining and Knowledge Discovery, 33, 917. https://doi.org/10.1007/s10618-019-00619-1

F1 score. (n.d.). In Wikipedia. Retrieved September 27, 2019, from https://en.wikipedia.org/wiki/F1_score

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778. doi: 10.1109/CVPR.2016.90

Hatami, N., Gavet, Y., & Debayle, J. (2017). Classification of time-series images using deep convolutional neural networks. International Conference on Machine Vision.

Hüsken, M., & Stagge, P. (2003). Recurrent neural networks for time series classification. Neurocomputing, 50, 223-235. Retrieved from https://www.sciencedirect.com/science/article/pii/S0925231201007068?via=ihub

Karpathy, A. (2015). The Unreasonable Effectiveness of Recurrent Neural Networks. [Weblog]. Retrieved from http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS, 2012.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Commun. ACM, 60(6), 84-90. DOI: https://doi.org/10.1145/3065386

Lines, J., Taylor, S., & Bagnall, A. (2016). HIVE-COTE: The hierarchical vote collective of transformation based ensembles for time series classification. IEEE International Conference on Data Mining, pp 1041-1046.

Lines, J., Taylor, S., & Bagnall, A. (2018). Time series classification with HIVE-COTE: The hierarchical vote collective of transformation-based ensembles. ACM Transactions on Knowledge Discovery from Data, 12(5), 52:1-52:35.

Rabinowitz, J. (2017). This Is How Flight Tracking Sites Work. Retrieved from https://thepointsguy.com/2017/09/how-flight-tracking-sites-work/

Skalski, P. (2019). Gentle Dive into Math Behind Convolutional Neural Networks. [Weblog]. Retrieved from https://towardsdatascience.com/gentle-dive-into-math-behind-convolutional-neural-networks-79a07dd44cf9

Sun, Z., Di, L. & Fang, H. (2019). Using long short-term memory recurrent neural network in land cover classification on Landsat and Cropland data layer time series. International Journal of Remote Sensing, 40(2), 593-614. DOI: 10.1080/01431161.2018.1516313. Retrieved from https://www.tandfonline.com/doi/abs/10.1080/01431161.2018.1516313

Tripathy, R. K., & Acharya, U. R. (2018). Use of features from RR-time series and EEG signals for automated classification of sleep stages in deep neural network framework. Biocybernetics and Biomedical Engineering, 38, 890-902.

Tsang, S. (2018). Review: ResNet — Winner of ILSVRC 2015 (Image Classification, Localization, Detection). [Weblog]. Retrieved from https://towardsdatascience.com/review-resnet-winner-of-ilsvrc-2015-image-classification-localization-detection-e39402bfa5d8

Wang, Z., Yan, W., & Oates, T. (2017). Time series classification from scratch with deep neural networks: A strong baseline. 2017 International Joint Conference on Neural Networks (IJCNN), 1578-1585.

Wang, Z., & Oates, T. (2015a). Encoding Time Series as Images for Visual Inspection and Classification Using Tiled Convolutional Neural Networks. Trajectory-Based Behavior Analytics: Papers from the 2015 AAAI Workshop. Retrieved from https://aaai.org/ocs/index.php/WS/AAAIW15/paper/viewFile/10179/10251

Wang, Z., & Oates, T. (2015b). Imaging time-series to improve classification and imputation. International Conference on Artificial Intelligence, pp 3939-3945.

Wang, Z., & Oates, T. (2015c). Spatially Encoding Temporal Correlations to Classify Temporal Data Using Convolutional Neural Networks. ArXiv, abs/1509.07481.

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., … Bengio, Y. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Proceedings of the 32nd International Conference on Machine Learning, in PMLR, 37, 2048-2057.

Yang, Q., Wu, X. (2006). 10 challenging problems in data mining research. Information Technology & Decision Making, 05(04), 597-604.

# APPENDICES

## Appendix A: Filesystem Layout

* dashlink
    * Train
        * 1min
            * .csv files
        * png3
            * non
            * typ
        * png3a
            * non
            * typ
            * export.pkl file
        * png3b
            * non
            * typ
            * export.pkl file
        * png3c
            * non
            * typ
            * export.pkl file
        * png3d
            * non
            * typ
            * export.pkl file
        * png3e
            * non
            * typ
            * export.pkl file
        * png3f
            * non
            * typ
            * export.pkl file
        * png2
            * .png files
            * canonical-segments.csv file
            * export.pkl file
        * png1
            * non
            * typ
            * export.pkl file
    * Test
        * png3
            * non
            * typ
        * src3
            * .png files
        * png2
            * non
            * typ
        * png1
            * non
            * typ
    * 10_mat2csv.ipynb
    * 10_mat2csv-2.ipynb
    * 10_csv2png-3.ipynb
    * 20_eda1.ipynb
    * 30_mod3a.ipynb
    * 30_mod3b.ipynb
    * 30_mod3c.ipynb
    * 30_mod3d.ipynb
    * 30_mod3e.ipynb
    * 30_mod3f.ipynb
    * 30_mod4-2.ipynb
    * 30_mod2.ipynb
    * 30_mod1.ipynb
    * 40_inf3.ipynb
    * 40_inf2.ipynb
    * 40_inf1.ipynb

NOTE: The Python notebook files (.ipynb) have a prefix that indicates the following:
* 10_ for data preparation
* 20_ for exploratory data analysis
* 30_ for modeling and training
* 40_ for inference and testing