content discovery... IN 3D
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
data initial commit Oct 7, 2016
flask_app initial commit Feb 5, 2017
.gitignore initial commit Sep 29, 2016
.gitmodules initial commit Nov 23, 2016
LICENSE initial commit Aug 1, 2017 4 processes now Oct 19, 2016 typo Oct 4, 2016
config.yml.example Add a Flask app. (#2) Nov 17, 2016 Add a Flask app. (#2) Nov 17, 2016
environment.yml Add a Flask app. (#2) Nov 17, 2016 initial commit Nov 8, 2016

Repo for building Sketchfab recommendations. Collecting data, training algorithms, and serving recommendations on a website will all be here.

This repo will likely not work for python 2 due to various encoding issues.

For some of the crawling processes, Selenium is used. You must provide a path to your browser driver in config.yml for this to work. See here for links to download the driver binary.

Collecting data

Use this script to crawl the Sketchfab site and collect data. Currently supports 4 processes as specified by --type argument:

  • urls - Grab the url of every sketchfab model with number of likes >= LIKE_LIMIT as defined in the config.
  • likes - Given collected model urls, collect users who have liked those models.
  • features - Given collected model urls, collect categories and tags associated with those models.
  • thumbs - Given collected model urls, collect 200x200 pixel thumbnails of each model.

Run like

python config.yml --type urls

I ran into lots of issues with timeouts when crawling features. To pick back up on a particular row of the urls file pass --start row_number as an optional argument.

Used to anonymize user_id's in likes data. Granted, one could probably back this out, but this serves as a small barrier of privacy.

To run, you must define a secret key for hashing the user_id's

python unanonymized_likes.csv anonymized_likes.csv "SECRET KEY"

The data

Model urls, likes, and features are all in the /data directory. These were roughly collected around October 2016.

All data are pipe-separated csv files with headers and with pandas read_csv() keyword arguments quoting=csv.QUOTE_MINIMAL and escapechar='\\'