# SURFsharekit scraped data analysis
This notebook explores some statistics of the data scraped from SURFsharekit. It will take the **newly** scraped data as input (in the `stimuleringsregeling/new` directory.

We will look at the number of documents per content type (HTML, PDF and docs) and provide a breakdown of the type of data on the HTML pages (this is manual work).

## TODOs
1. Look at failed scrapes

Two dimensions along which to classify:
1. Material type (exercise, lecture, source data, syllabus, mix, etc.);
1. Content type (Jupyter notebook, Word-document, HTML page, PDF, video, audio, etc.).

**Create a breakdown for both of them, if possible.**

In [None]:
import os
import json
from glob import glob
from itertools import chain, groupby
from urllib.parse import urlparse

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Plot customization

import matplotlib as mpl
mpl.rcParams['font.size'] = 16

**Change the location of the SURFsharekit scrape data to your surfdrive location in the cell below. Be aware that this should be the location of the `new` directory!**

In [None]:
SHAREKIT_OUTPUT_PATH = '~/surfdrive/POL/scraped/stimuleringsregeling/new'

In [None]:
# Some helper functions

def load_json(path):
    with open(path) as stream:
        return json.load(stream)
    
    
def countby(iterable, keyfunc):
    return dict(((key, len(list(values))) for key, values in groupby(iterable, keyfunc)))


def identity(x):
    return x

Count the number of documents (SURFsharekit pages)

In [None]:
json_files = glob(os.path.expanduser(os.path.join(SHAREKIT_OUTPUT_PATH, '*.json')))
sharekit_documents = [load_json(path) for path in json_files]
len(sharekit_documents)

Count the number of *attachments* in total across all documents.

In [None]:
attachments = list(chain(*[_['documents'] for _ in sharekit_documents]))
len(attachments)

## Scrape exceptions
Dump a count of the number of times a certain scrape exception occurred. Youtube links aren't scraped at the moment, which is why we find **437** `'youtube'` 'exceptions'.

In [None]:
def parse_exception_text(text):
    if 'IsYoutubeLink' in text:
        return 'youtube'
    return text

scrape_exceptions = [_['exception'] for _ in attachments if 'exception' in _]
mapped_exceptions = (parse_exception_text(_) for _ in scrape_exceptions)

exception_counts = countby(sorted(mapped_exceptions), identity)
for k, v in exception_counts.items():
    print('{}: {}'.format(k, v))
sum(exception_counts.values())

## Count by content type
Count the number of documents for which we have a content type, which should be equal to the number of attachments minus the number of exceptions.

In [None]:
HUMANIZED_CONTENT_TYPES = {
    'application/pdf': 'pdf',
    'application/vnd.openxmlformats-officedocument.presentationml.presentation': 'presentation',
    'application/vnd.openxmlformats-officedocument.wordprocessingml.document': 'word',
    'text/html': 'HTML - other'
}

def humanize_content_type(content_type):
    if 'text/html' in content_type:
        content_type = 'text/html'
    return HUMANIZED_CONTENT_TYPES[content_type]


content_types = [humanize_content_type(_['content-type']) for _ in attachments if 'content-type' in _]
len(content_types)

Of the 321 attachments, dump the content type counts.

In [None]:
content_type_counts = countby(sorted((_.lower() for _ in content_types)), identity)
for k, v in content_type_counts.items():
    print('{}: {}'.format(k, v))

For a complete overview, add the Youtube links count to the content type counts as well.

In [None]:
content_type_counts['html - youtube'] = exception_counts['youtube']

Create a Pandas data frame with content types and their counts.

In [None]:
content_type_counts_df = pd.DataFrame({
    'content_type': list(content_type_counts.keys()),
    'count': list(content_type_counts.values())
})
content_type_counts_df

Plot this table as a bar chart and pie chart.

In [None]:
_, axes = plt.subplots(figsize=(16, 7), nrows=1, ncols=2)
sns.barplot(data=content_type_counts_df, x='content_type', y='count', ax=axes[0])
for tick in axes[0].get_xticklabels():
    tick.set_rotation(90)
axes[1].pie(content_type_counts_df['count'], labels=content_type_counts_df['content_type'])
plt.suptitle('#documents per content type (total: {})'.format(content_type_counts_df['count'].sum()));
# plt.savefig('surfsharekit-content-type.pdf', bbox_inches='tight');

## Inspect non-Youtube HTML content
We would like to find out what kind of content is behind the 156 URLs in the 'html - other' category above, which is everything HTML that isn't Youtube.

We manually inspect the top 10 domains referenced by non-Youtube HTML content type documents, and start by dumping the number of URLs for each unique domain found in the attachments.

In [None]:
html_attachments = [_['url'] for _ in attachments if 'content-type' in _ and 'text/html' in _['content-type']]
netlocs = [urlparse(_).netloc for _ in html_attachments]

attachment_urls = pd.DataFrame({
    'urls': html_attachments,
    'netloc': netlocs
})
attachment_urls.groupby('netloc').count().sort_values('urls', ascending=False).head(20)

We manually inspect a number of URLs for each domain that we found in the attachments.

In [None]:
list(attachment_urls.loc[attachment_urls['netloc'] == 'xoteur.12change.eu'].urls)

## Notes on top 10 non-Youtube HTML pages
1. `xoteur.12change.eu`: Erasmus MC training portal. Case studies, short movies, exercises (multiple choice and free-form). Requires clicking, user input.
1. `lecturenet.uu.nl`: Short movies from UU. Mostly earth sciences, it seems.
1. `vimeo.com`: UvA data science videos.
1. `app.dwo.nl`: exercises from the Digitale Wiskunde Omgeving. Very clicky.
1. `coo.erasmusmc.nl`: another set of virtual cases studies from the Erasmus MC.
1. `online.codarts.nl`: Codarts is the applied university of arts in Rotterdam. URLs are different, but because we're not logged in, they all seem to redirect to the same page. Exercises.
1. `github.com`: tool repositories for UvA programming course, and notebooks for Tilburg University data science course.
1. `player.ou.nl`: data science videos from the Open University.
1. `my.questbase.com`: Codarts interactive exercises, with audio.
1. `figshare.com`: source files (GIS data, videos) for UU earth sciences.

We do a quick and dirty classification of URLs for the top 10 domains. Anything else will be listed as 'unclassified'. The definition of content type loses its rigour here, as we're also classifying to material type.

In [None]:
# Quick-and-dirty classification of non-Youtube HTML content

HTML_URL_MAPPING = {
    'xoteur.12change.eu': 'case study',
    'lecturenet.uu.nl': 'video',
    'vimeo.com': 'video',
    'app.dwo.nl': 'exercise',
    'coo.erasmusmc.nl': 'case study',
    'online.codarts.nl': 'exercise',
    'github.com': 'Jupyter notebook',
    'player.ou.nl': 'video',
    'my.questbase.com': 'exercise',
    'figshare.com': 'source data',
    'www.edx.org': 'online course',
    'docs.python.org': 'python docs'
}


def classify_non_html_attachment_url(url):
    return HTML_URL_MAPPING.get(urlparse(url).netloc, 'unclassified')

Create a Pandas data frame with material type and count.

In [None]:
html_material_types = (classify_non_html_attachment_url(_) for _ in list(attachment_urls['urls']))
_ = countby(sorted(html_material_types), identity)
material_types = pd.DataFrame({
    'count': list(_.values()),
    'material': list(_.keys())
})
material_types

Plot material type and the associated count in the non-Youtube HTML attachments.

In [None]:
_, axes = plt.subplots(figsize=(16, 7), nrows=1, ncols=2)
sns.barplot(data=material_types, x='material', y='count', ax=axes[0])
for tick in axes[0].get_xticklabels():
    tick.set_rotation(90)
axes[1].pie(material_types['count'], labels=material_types['material'])
plt.suptitle('#non-Youtube HTML material type (estimated) (total: {})'.format(material_types['count'].sum()));
# plt.savefig('non-youtube-html-material-type.pdf', bbox_inches='tight');