# A Zipf classifier
## Using Zipf and statistics to map books fragments to their authors and period

## Introduction
Hello, I'm Luca Cappelletti and here I will show you an example of usage of [ZipfClassifier](https://github.com/LucaCappelletti94/zipf_classifier), a classifier that leverages the assumption that some kind of datasets (texts, [some images](http://www.dcs.warwick.ac.uk/bmvc2007/proceedings/CD-ROM/papers/paper-288.pdf), even [sounds in spoken languages](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0033993)) follows the [Zipf distribution](https://en.wikipedia.org/wiki/Zipf%27s_law).

## How to use this notebook

This is a [Jupyter Notebook](http://jupyter.org/). You can either read it [here on github](https://github.com/LucaCappelletti94/zipf_classifier/blob/master/Classifying%20authors.ipynb) or, preferably, run it on your own computer. Jupyter comes installed with [Anaconda](https://anaconda.org/), to execute it you just need to run the following in your terminal:

`jupyter-notebook`

## What we will use

### The packages
We will use obviously the [ZipfClassifier](https://github.com/LucaCappelletti94/zipf_classifier) and other two packages of mine: [Zipf](https://github.com/LucaCappelletti94/zipf) to create the distributions from the texts and [Dictances](https://github.com/LucaCappelletti94/dictances) for the classifications metrics. If you need to install them just run the following command in your terminal:

```pip install zipf dictances zipf_classifier```

In [1]:
from zipf.factories import ZipfFromDir
from zipf_classifier import ZipfClassifier
from dictances import jensen_shannon, normal_total_variation, kullback_leibler, bhattacharyya, bhattacharyya_coefficient, hellinger

### Additional packages
We will also be using some utilities, such as the loading bar `tqdm` and the `requests` package. If you don't have them already you can install them by running:

```
pip install tqdm requests tabulate
```

The others packages should be already installed with python by default.

In [2]:
import io
import inspect
import json
import math
import os
import random
import shutil
import zipfile
from collections import defaultdict
from pprint import pprint

import requests
from tqdm import tqdm_notebook as tqdm

from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import HtmlFormatter
from IPython.display import HTML, display
import tabulate

### The dataset
I've prepared a **datasets** of old famous authors (Mary Shelley, Dumas, Carroll...), available in the repo and downloadable [here](https://github.com/LucaCappelletti94/zipf_classifier/blob/master/dataset.zip?raw=true): the authors are separeted by period and their books are already split into chapters, with extension `.txt`.

#### Retrieving the dataset
We download and extract the dataset:

In [3]:
dataset_name = "authors"
zip_file_url = "https://github.com/LucaCappelletti94/zipf_classifier/blob/master/%s.zip?raw=true"%dataset_name

In [4]:
if not os.path.isdir(dataset_name):
    print("Downloading %s.zip"%dataset_name)
    r = requests.get(zip_file_url)
    z = zipfile.ZipFile(io.BytesIO(r.content))
    print("Extracting %s.zip"%dataset_name)
    z.extractall()
    print("Done!")

##### Some small helpers
Let's make ome small functions to help out loading folders:

In [5]:
def get_dirs(root):
    """Return subdirectories under a directory."""
    return [root+"/"+d for d in os.listdir(root) if os.path.isdir(root+"/"+d)]

and the book folders:

In [6]:
def get_books(root):
    """Return all books found under a given root."""
    return [book[0] for book in os.walk(root) for chapter in book[2][:1] if chapter.endswith('.txt')]

and the saved zipfs:

In [7]:
def get_zipfs(root):
    """Return all zipfs found under a given root."""
    return [zipfs[0]+"/"+zipf for zipfs in os.walk(root) for zipf in zipfs[2] if zipf.endswith('.json')]

#### Some stylers

In [8]:
def b(string):
    """Return a boldified string."""
    return "\033[1m%s\033[0;0m"%string

In [9]:
def red(string):
    """Return a red string."""
    return "\033[0;31m%s\033[0;0m"%string

In [10]:
def green(string):
    """Return a green string."""
    return "\033[0;32m%s\033[0;0m"%string

In [11]:
def yellow(string):
    """Return a yellow string."""
    return "\033[0;33m%s\033[0;0m"%string

In [12]:
def success(results, metric):
    """Show the result of a given test."""
    successes = results["success"]
    total = successes + results["failures"] + results["unclassified"]
    percentage = round(successes/total*100,2)
    if percentage > 90:
        metric_name = green(metric.__name__)
    elif percentage > 75:
        metric_name = yellow(metric.__name__)
    else:
        metric_name = red(metric.__name__)
    print("Success with metric %s: %s"%(metric_name,b(str(percentage)+"%")))
    display(HTML(tabulate.tabulate(list(results.items()), tablefmt='html')))

In [13]:
def print_function(function):
    """Print the source of a given function."""
    code = inspect.getsource(function)
    formatter = HtmlFormatter()
    display(HTML('<style type="text/css">{}</style>{}'.format(
        formatter.get_style_defs('.highlight'),
        highlight(code, PythonLexer(), formatter))))

#### Splitting into train and test
Let's say we leave 60% to training and 40% to testing. Let's proceed to split the dataset in two:

In [14]:
training_percentage = 0.6

First we check if the dataset is already split (this might be a re-run):

In [15]:
def is_already_split(root):
    """Return a bool indicating if the dataset has already been split."""
    split_warns = ["training", "testing"]
    for sub_dir in os.listdir(root):
        for split_warn in split_warns:
            if split_warn in sub_dir:
                return True
    return False

Then we split the dataset's books:

In [16]:
def split_books(root, percentage):
    """Split the dataset into training and testing."""
    min_books = math.inf
    for author in get_dirs(root):
        books = get_books(author)
        min_books = min(min_books, len(books))
    for author in get_dirs(root):
        books = get_books(author)
        random.seed(42) # for reproducibility
        random.shuffle(books) # Shuffling books
        books = books[:min_books]
        n = int(min_books*percentage)
        training_set, testing_set = books[:n], books[n:] # splitting books into the two partitions
        # Moving into respective folders
        [shutil.copytree(book, "%s/training/%s"%(dataset_name, book[len(root)+1:])) for book in training_set]
        [shutil.copytree(book, "%s/testing/%s"%(dataset_name, book[len(root)+1:])) for book in testing_set]

Here we actually run the two functions:

In [17]:
if is_already_split(dataset_name):
    print("I believe I've already split the dataset!")
else:
    split_books(dataset_name, training_percentage)

I believe I've already split the dataset!


### The metrics

After experimental analysis I've determined that the [Jensen Shannon Divergence](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence) is one of the best metric for this kind of classification, other being the [Hellinger distance](https://en.wikipedia.org/wiki/Hellinger_distance) and the [Bhattacharyya distance](https://en.wikipedia.org/wiki/Bhattacharyya_distance). 

_A small test to experimentally verify this claim is run in the end of the presentation._

#### Reasons for using these distances
1. They are defined on distributions that do not necessarily share all events.
2. They can be implemented with computational complexity $O(\min(n,m))$, where $n$ and $m$ are respectively the length of two distributions $X, Y$ when one assumes that the $\sum_{i\in X}^n x_i = 1$ and $\sum_{i\in Y}^m y_i = 1$.

In [18]:
print_function(jensen_shannon)

In [19]:
print_function(normal_total_variation)

In [20]:
print_function(kullback_leibler)

In [21]:
print_function(bhattacharyya_coefficient)

In [22]:
print_function(bhattacharyya)

In [23]:
print_function(hellinger)

### The options
We will use the following options for training and testing. More informations about options customizations is available [here](https://github.com/LucaCappelletti94/zipf).

In [24]:
options = {}

## Creating the Zipfs
We will now convert all the chapters in the dataset into the respective zipf for each option.

In [25]:
def create_zipfs(paths, factory, test_path):
    for data_path in tqdm(paths, unit=' zipf'):
        path = "%s/%s.json"%(test_path, '/'.join(data_path.split('/')[1:]))
        # If the zipf already exists we skip it
        if os.path.exists(path):
            continue
        path_dirs = '/'.join(path.split('/')[:-1])
        zipf = factory.run(data_path, ['txt'])
        if not zipf.is_empty():
            if not os.path.exists(path_dirs):
                os.makedirs(path_dirs)
            zipf.save(path)

We define the paths for zipfs and their sources:

In [26]:
training_path = "%s/training"%dataset_name
testing_path = "%s/testing"%dataset_name
zipfs_path = '%s/zipfs'%dataset_name

print("I will print training zipfs from %s,\ntesting zipfs from %s\nand save them in %s"%(b(training_path), b(testing_path), b(zipfs_path)))

I will print training zipfs from [1mauthors/training[0;0m,
testing zipfs from [1mauthors/testing[0;0m
and save them in [1mauthors/zipfs[0;0m


We create a factory for creating the zipfs objects from files with the options defined above.

In [27]:
factory = ZipfFromDir(options=options)
print("Created a factory with options %s"%(factory))

Created a factory with options {
  "remove_stop_words": false,
  "stop_words": "it",
  "minimum_count": 0,
  "chain_min_len": 1,
  "chain_max_len": 1,
  "chaining_character": " ",
  "sort": false
}


First we create the testing zipfs:

In [28]:
print("Creating training zipfs in %s"%(b(training_path)))
authors = get_dirs(training_path)
print("Some of the paths I'm converting are:")
pprint(authors[:10])
create_zipfs(authors, factory, zipfs_path)

Creating training zipfs in [1mauthors/training[0;0m
Some of the paths I'm converting are:
['authors/training/dh_lawrence',
 'authors/training/twain',
 'authors/training/wilde']





And then the training zipfs:

In [29]:
print("Creating testing zipfs in %s"%(b(testing_path)))
books = get_books(testing_path)
print("Some of the paths I'm converting are:")
random.seed(42)
random.shuffle(books)
pprint(books[:10])
create_zipfs(books, factory, zipfs_path)

Creating testing zipfs in [1mauthors/testing[0;0m
Some of the paths I'm converting are:
['authors/testing/dh_lawrence/3423',
 'authors/testing/wilde/2242',
 'authors/testing/wilde/2258',
 'authors/testing/twain/3252',
 'authors/testing/twain/3286',
 'authors/testing/dh_lawrence/3439',
 'authors/testing/twain/2848',
 'authors/testing/wilde/2235',
 'authors/testing/wilde/2250',
 'authors/testing/wilde/2236']






## Creating the Classifier
Now we have rendered the training. Let's run some tests!

First we create the classifier with the options set above:

In [30]:
classifier = ZipfClassifier(options)

In [31]:
training_zipfs_path = "%s/training"%zipfs_path
testing_zipfs_path = "%s/testing"%zipfs_path

In [32]:
print("Loading zipfs from %s"%(b(training_zipfs_path)))
loaded = []
for zipf in tqdm(get_zipfs(training_zipfs_path)):
    author = zipf.split('/')[-1].split('.')[0]
    args = zipf, author
    loaded.append(args)
    classifier.add_zipf(*args)

Loading zipfs from [1mauthors/zipfs/training[0;0m





In [33]:
print("Some of the loaded zipfs and its class:")
for path, cls in loaded[:10]:
    print("Path: %s, class: %s"%(b(path), b(cls)))

Some of the loaded zipfs and its class:
Path: [1mauthors/zipfs/training/twain.json[0;0m, class: [1mtwain[0;0m
Path: [1mauthors/zipfs/training/dh_lawrence.json[0;0m, class: [1mdh_lawrence[0;0m
Path: [1mauthors/zipfs/training/wilde.json[0;0m, class: [1mwilde[0;0m


In [34]:
print("Loading tests from %s"%(b(testing_zipfs_path)))
test_couples = []
for zipf in tqdm(get_zipfs(testing_zipfs_path)):
    author = zipf.split('/')[-2]
    args = zipf, author
    test_couples.append(args)

Loading tests from [1mauthors/zipfs/testing[0;0m





In [35]:
print("Some of the loaded zipfs and its class:")
random.seed(42)
random.shuffle(test_couples)
for path, cls in test_couples[:10]:
    print("Path: %s, class: %s"%(b(path), b(cls)))

Some of the loaded zipfs and its class:
Path: [1mauthors/zipfs/testing/dh_lawrence/3464.json[0;0m, class: [1mdh_lawrence[0;0m
Path: [1mauthors/zipfs/testing/wilde/2242.json[0;0m, class: [1mwilde[0;0m
Path: [1mauthors/zipfs/testing/wilde/2229.json[0;0m, class: [1mwilde[0;0m
Path: [1mauthors/zipfs/testing/twain/the-bequest.json[0;0m, class: [1mtwain[0;0m
Path: [1mauthors/zipfs/testing/twain/my-autobiography.json[0;0m, class: [1mtwain[0;0m
Path: [1mauthors/zipfs/testing/dh_lawrence/3435.json[0;0m, class: [1mdh_lawrence[0;0m
Path: [1mauthors/zipfs/testing/wilde/2239.json[0;0m, class: [1mwilde[0;0m
Path: [1mauthors/zipfs/testing/wilde/2318.json[0;0m, class: [1mwilde[0;0m
Path: [1mauthors/zipfs/testing/wilde/2273.json[0;0m, class: [1mwilde[0;0m
Path: [1mauthors/zipfs/testing/twain/321.json[0;0m, class: [1mtwain[0;0m


In [36]:
print(b("Testing"))
results = classifier.test(test_couples, jensen_shannon)

[1mTesting[0;0m


Done testing 100.000000%

In [37]:
success(results, jensen_shannon)

Success with metric [0;33mjensen_shannon[0;0m: [1m86.18%[0;0m


0,1
success,106.0
failures,16.0
unclassified,1.0
mean_delta,0.0171099
Mistook Wilde for Twain,9.0
Mistook Dh_lawrence for Twain,1.0
Mistook Dh_lawrence for Wilde,2.0
Mistook Wilde for Dh_lawrence,4.0


In [38]:
def metrics_test(classifier, test_couples):
    """Run test on metrics usable on zipfs."""
    metrics = [normal_total_variation, kullback_leibler, bhattacharyya, hellinger, jensen_shannon]
    for metric in tqdm(metrics, unit='metric'):
        results = classifier.test(test_couples, metric)
        success(results, metric)

In [39]:
metrics_test(classifier, test_couples)

Done testing 100.000000%

Success with metric [0;31mnormal_total_variation[0;0m: [1m74.8%[0;0m


0,1
success,92.0
failures,31.0
unclassified,0.0
mean_delta,0.0253945
Mistook Dh_lawrence for Wilde,8.0
Mistook Wilde for Twain,12.0
Mistook Dh_lawrence for Twain,6.0
Mistook Wilde for Dh_lawrence,4.0
Mistook Twain for Wilde,1.0


Done testing 100.000000%

Success with metric [0;31mkullback_leibler[0;0m: [1m52.85%[0;0m


0,1
success,65.0
failures,58.0
unclassified,0.0
mean_delta,0.156089
Mistook Twain for Wilde,23.0
Mistook Wilde for Dh_lawrence,6.0
Mistook Dh_lawrence for Wilde,28.0
Mistook Wilde for Twain,1.0


Done testing 100.000000%

Success with metric [0;33mbhattacharyya[0;0m: [1m86.99%[0;0m


0,1
success,107.0
failures,16.0
unclassified,0.0
mean_delta,0.0655142
Mistook Wilde for Twain,8.0
Mistook Dh_lawrence for Twain,1.0
Mistook Wilde for Dh_lawrence,4.0
Mistook Dh_lawrence for Wilde,3.0


Done testing 100.000000%

Success with metric [0;32mhellinger[0;0m: [1m91.06%[0;0m


0,1
success,112.0
failures,11.0
unclassified,0.0
mean_delta,0.0229206
Mistook Wilde for Twain,6.0
Mistook Wilde for Dh_lawrence,4.0
Mistook Dh_lawrence for Wilde,1.0


Done testing 100.000000%

Success with metric [0;33mjensen_shannon[0;0m: [1m86.18%[0;0m


0,1
success,106.0
failures,16.0
unclassified,1.0
mean_delta,0.0171099
Mistook Wilde for Twain,9.0
Mistook Dh_lawrence for Twain,1.0
Mistook Dh_lawrence for Wilde,2.0
Mistook Wilde for Dh_lawrence,4.0



