GitHub - matpalm/trending: testing out some trending algorithms, mostly written in hadoop pig

This repository has been archived by the owner on Dec 3, 2018. It is now read-only.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
kalman_filter		kalman_filter
lib		lib
pig		pig
site		site
tt		tt
.gitignore		.gitignore
README		README
cheese_tweets.sample.tsv		cheese_tweets.sample.tsv
rc		rc

Repository files navigation

trying out some simple trending algorithms

notes on http://matpalm.com/blog/tag/e15

TODOS:
 - inverted index from trending terms back to the documents that use them
 - ability to facet; ie trends per forum as well as overall trends
 - include ignore_punc.rb; (eg ["can","'","t"] -> ["can't"]
 - only count term once per document (?)

DATA PREP:

# need to sort by time, not id, since that's how we bucket into the timeslots
$ zcat ../tt/tt.posts.tsv.gz | head -n1000 | sort -t$'\t' -k2 -n | ../tt/extract_body.rb | split_into_chunks.rb

v1)

only consider tokens freq when it token occurs

to run ruby version
bash> source rc
then see lib/run.sh for the end to end script to build all the data for generating the prj page graphs

to run pig version
cd pig
cat run.sh for info

trending score = fraction over twice sd

v3a)

combination; start considering tokens when they are first seen but from then if token is not seen then
assume zero value for that timeslice

forget about fft cases

consider trending value BEFORE folding chunk into model 
(makes huge difference to 1,2,3,2,2,3,4,20 style cases)

trending score = fraction of sd over the mean

if token appears n times in a single document it counts for n in chunk;
deciding if need to change this to counting for 1 per chunk (since cases of a post like 'shut up shut up shut up shut up shut up shut up' cause grief
perhaps tf/idf would be better actually...