# Stackoverflow: basic analysis

![Screenshot](screenshot.png "The Stackoverflow website")

https://stackoverflow.com/ is a wonderful resource for people with a professional background in software development, beginners, or all intermediate levels. Users can ask detailed questions about topics related to software and software development. People with more experience in the topic will respond to those questions. This leads to fruitful discussions and interesting connections between users with different levels of expertise, and of course connections between different areas in software development.

The idea of the basic analysis is to plot general values like the number of posts versus time, additionally split into different categories / tags.
The frequency of tags and the relation between those tags is of further interest. We want to also plot the network of tags in order to discover interesting connections and links (possible as a function of time).
The later goal of training tensorflow to predict good questions and responses will be attempted in a seperate notebook.

The data of posts, comments, etc from the stackoverflow website as well as all other stack exchange websites is openly available at https://archive.org/details/stackexchange under the CreativeCommons 3.0 license. We hereby indicate that proper reference to the source of the data was given.

All of the data is provided in 7zipped XML files with a clear and clean structure.
These huge XML files can be unzipped and read, or alternatively, iterated over directly in zipped form. We decided to start with the latter approach in order to not occupy too much disk space. Unzipping the files first, however, enables us to have more freedom in parallelizing the reading of data. This might need to be explored more at a later point.

The python library hosted at https://github.com/smartfile/python-libarchive provides an interface to libarchive written in C++. Using this library, we can open and iterative 7z files directly without unpacking them first.

The technical part of iterating over the zipped files and saving them out is done by a python module found here: https://github.com/AkagiShigeru/incubator-overflow/blob/master/stack_readin.py. Through the use of a generator to iterate over the entries in the file, the memory consumption is very low. The writing to the HDF file is done sequentielly. A certain chunk from memory is written and the memory is released.

For the purpose of some first insightful plots, we will start to analyse the subset of the first 100 million posts. These were written out into an hdf5 file and can be conveniently read with pandas and its interface to pytables.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
from pyik.mplext import ViolinPlot
from matplotlib import pyplot as plt

We are using the PyIK library (https://github.com/HDembinski/pyik) for some special plots and analyses. We are co-authors of this nice library with some neat additions to numpy and matplotlib.

In [1]:
# path to the hdf file containing the first 5M posts (posts or comments)
posts_path = "/home/alex/data/stackexchange/overflow/caches/posts_first5M.hdf5"

In [None]:
store = pd.HDFStore(posts_path, "r", complib="blosc", complevel=9)

# the hdf file containing posts was written in pytables tables format
# this optionally allows the user to query for a subset of the data on disk without ever loading all data into RAM
# smask = store.select_as_coordinates("posts", "Id > %f" % 10000)
# posts = store.select("data", where=smask)