# Clustering-based Analysis of Handwritten Digits

This notebook is part of the DHNB 2022 tutorial "Introduction to Text and Image Analysis Using Python" (website: [https://raphaelaheil.github.io/2022-03-15-dhnb/](https://raphaelaheil.github.io/2022-03-15-dhnb/)). 

## Session Objectives

- demonstrate a general image processing pipeline
- provide an example of how clustering can be used to annotate data

## Dataset

This notebook uses the following dataset:
> "DIDA is a new image-based historical handwritten digit dataset and collected from the Swedish historical handwritten document images between the year 1800 and 1940"
>
> from [https://didadataset.github.io/DIDA/](https://didadataset.github.io/DIDA/)

>A Deep Handwritten Digit Detection and Recognition Method Using a New Historical Handwritten Digit Dataset, Huseyin Kusetogullari, Amir Yavariabdi, Johan Hall, Niklas Lavesson,
DIGITNET: 
Big Data Research,
Volume 23,
2021,
100182,
ISSN 2214-5796,
https://doi.org/10.1016/j.bdr.2020.100182.
(https://www.sciencedirect.com/science/article/pii/S2214579620300502)

Original dataset download: [https://didadataset.github.io/DIDA/](https://didadataset.github.io/DIDA/)

Workshop dataset download: [https://github.com/RaphaelaHeil/clustering-dhnb/releases/download/v1.0/digits.zip](https://github.com/RaphaelaHeil/clustering-dhnb/releases/download/v1.0/digits.zip)

_For usability reasons, the original dataset has been restructured and compressed as a *.zip, instead of *.rar for the purpose of the workshop. The images themselves however remain unchanged._

## General Pipeline

1. Load images
2. Preprocess
    1. resize images to fixed dimensions (64x64px)
    2. turn colour (RGB) images into greyscale
    3. turn greyscale images into black and white ("[Otsu's Method](https://en.wikipedia.org/wiki/Otsu%27s_method)")
4. Extract features ("[Histogram of Oriented Gradients](https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients)")
5. Cluster features ("[k-Means](https://en.wikipedia.org/wiki/K-means_clustering)") 

![Visualisation of intermediate results of the processing pipeline](pipeline.svg)

Clustering-based Analysis of Handwritten Characters--- 

## Required Packages

In [1]:
import matplotlib.pyplot as plt
import numpy as np

from skimage.color import rgb2gray
from skimage.feature import hog
from skimage.filters import threshold_otsu
from skimage.transform import resize

from sklearn.cluster import KMeans
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

import utils

### Brief introduction of the main packages:

**Matplotlib**
> Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
>
> from [https://matplotlib.org/](https://matplotlib.org/)

Documentation: [https://matplotlib.org/stable/api/index](https://matplotlib.org/stable/api/index)

Extensive gallery of examples: [https://matplotlib.org/stable/gallery/index.html](https://matplotlib.org/stable/gallery/index.html)

**Numpy**
> "NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more." 
>
> from [https://numpy.org/](https://numpy.org/)

Documentation: [https://numpy.org/doc/stable/reference/index.html#reference](https://numpy.org/doc/stable/reference/index.html#reference)

**Skimage (scikit-image)**
> "scikit-image is a collection of algorithms for image processing."
>
> from [https://scikit-image.org/](https://scikit-image.org/)

Documentation: [https://scikit-image.org/docs/stable/api/api.html](https://scikit-image.org/docs/stable/api/api.html)

Gallery of examples: [https://scikit-image.org/docs/stable/auto_examples/index.html](https://scikit-image.org/docs/stable/auto_examples/index.html)

**Sklearn (scikit-learn)**
> "Simple and efficient tools for predictive data analysis."
> 
> from [https://scikit-learn.org/](https://scikit-learn.org/)

Documentation: [https://scikit-learn.org/stable/modules/classes.html](https://scikit-learn.org/stable/modules/classes.html)

User guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)

**utils**

A utility package, made specially for this tutorial. It contains helper methods for loading, processing and visualising the data.

## 1 Loading the data

## 2 Pre-processing

1. resize images to fixed dimensions (64x64px)
2. turn colour (RGB) images into greyscale
3. turn greyscale images into black and white ("[Otsu's Method](https://en.wikipedia.org/wiki/Otsu%27s_method)")

### 2.1 Resize

### 2.2 Greyscale Conversion

### 2.3 Conversion to Black and White

## 3 Feature Extraction: Histogram of Oriented Gradients

## 4 Clustering: k-Means Algorithm

## 5 Bulk Annotation

## 6 Ground Truth Comparison