Generating fun Stack Exchange questions using Markov chains
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


Generating fun Stack Exchange questions using Markov chains

try it out


  • python 3.5+ (only tested with python 3.6)
  • 7z

For Debian and similar distribution install with:

sudo apt-get install p7zip-full


  • git clone with submodules
git clone
cd se-simulator
git submodule init
git submodule update
  • pip install -r requirements.txt
  • create a MySQL database called se-simulator
  • rename to and fill in the database details and create a secret_key
  • run, which creates the database and fetches the list of SE sites
  • run (which should run really quickly)
  • create folders called chains, download and raw (or syminks to somewhere where more disk space is left)
  • [download](] .7z files for the sites you want to generate (it's recommend to start with a file <100MB)
    • If the .7z has another name as the site has now, rename it
  • run
    • It should check the hash, move the file to raw/, unpack it and extract the needed content from the .xml files into new .jsonl files. It also writes the data of the file into the db, so it won't be imported again.
  • now the most important step: run
    • this will generate the markov chains and save them (or use existing ones on the next run)
    • afterwards 100 questions will be added to the db, with corresponding answers, titles and usernames
  • run
    • I haven't found a performant way to get a random question without asigning every question an integer and saving the maximum to count.txt
  • run
    • this starts the Flask server on
    • if I didn't miss an important step, the site should be working fine now.

other files

  • needed for Flask
  • and peewee ORM
  • manually collected colors of every site with an custom theme
  • extending the great markovify library for my use case
  • reading in the Stack Exchange dump XML files with no more than 40MB RAM usage.
  • everything that creates the content and handles the Markov chains
  • probably not working anymore, checks for newer dump files
  • everything else