Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
1319 lines (1075 sloc) 102 KB

Data Science & Engineering

Data Languages

  • R vs Python – Round 2 - The Swarm Lab @ NJIT
  • The comments at the end of this analysis of the academic popularity of R, S
  • A Python Compiler for Big Data
  • lynnlangit: A programmer’s guide to big data: 12 tools to know
  • Download Revolution R - Free for Academia - Revolution Analytics
  • Neeeaato RT @golan: Processing R on OSX:…
  • "Sort of astounding how much dplyr/tidyr/magrittr have changed my R code. Can hardly read my old stuff anymore. #nerdtwitter @hadleywickham" - Ben Casselman

    Experiments & Research

  • Seven A/B testing mistakes you need to stop making in 2013 « I love split t
  • » How do you prioritize research?
  • 20 lines of code that will beat A/B testing every time - Steve Hanov's Prog
  • Optimization at the Obama campaign: a/b testing
  • The dark side of data - Strata
  • Random Observations: A/B testing scale cheat sheet
  • Bill Gates on the Importance of Measurement -
  • Open Data

  • Open data is not a panacea « mathbabe
  • Data Acquisition, Sources

  • Where can I find large datasets open to the public? - Quora
  • Web scraping 101 with Python:
  • Research Blog: Ngram Viewer 2.0
  • 9 uses for cURL worth knowing | httpkit | Tools for hacking on HTTP
  • Cold Hard Facts - The Million Song Dataset Challenge: Part I
  • Easy Stats
  • (null): (null)
  • bwtaylor: You want data? @hmason thanks for compiling
  • jluismarin: It seems Santa Claus brought us the European Commission #Openda
  • lynnlangit: Got data? Make a quick list of public data sets to use as you l
  • » Bitly Social Data APIs
  • Let's Scrape the Web with Python 3 -
  • Greg Reda | Web Scraping 101 with Python
  • Some friends wanted to learn how to scrape data, so I wrote a very basic in
  • Data Cleaning/Munging , Feature Engineering

  • google-refine - Google Refine, a power tool for working with messy data (fo
  • Data Wrangler
  • Data Viz

  • Graph Design for the Eye and Mind, Stephen Kosslyn | Civil Statistician
  • peschkaj: "The top 20 data visualisation tools" - reco
  • HlthAnalysis: Explaining Bayesian Problems Using Visualizations ( Video) ht
  • Gephi, an open source graph visualization and manipulation software
  • Kim Rees: Data Visualization Primers
  • Yoda: Big Data Hype Small Data Hype Visualization Hype – “Technology is not
  • Responsive Information Design for Infographics with Dynamic Data | Visual.l
  • tableau: Why you should seriously consider dropping in to the Iron Viz Cham
  • mikedewar: Data Vis Jobs Google Group. An important resource for those of u
  • LlewellynFalco: If you are thinking about data visualization, check out thi
  • A Carefully Selected List of Recommended Tools on
  • m.e.driscoll: data utopian • data visualization is a halfway house
  • FastCompany: The best Hurricane #Sandy visualizations:
  • drewconway: This is why density plots are tricky
  • peschkaj: Zeus's affairs: #dataviz
  • hmason: cute graphing hack:
  • DataKind: All the smartest people in viz talking about the state of data: g
  • Stack Overflow Data Visualization Contest « Blog – Stack Exchange
  • Sparky
  • albertocairo: While writing a post about radar graphs, I came across this p
  • CourseWiki - CS 448B
  • brainpicker: A visual history of Arabian Nights, some of history's greatest
  • hmason: I love these ideas for visualizing uncertainty in plots
  • idigdata: Great collection of data visualization libraries and tools http:/
  • noahi: FYI: There’s a visualization jobs list, founded by @arnicas, at http
  • Beautiful Weather Graphs and Maps - WeatherSpark
  • How to Make a US County Thematic Map Using Free Tools
  • How to Make a Heatmap – a Quick and Easy Solution
  • xkcd: Heatmap
  • The Data Behind My Ideal Bookshelf | Fred Benenson's Blog
  • Art-Spire, Source d'inspiration artistique / 32 amazing data visualization
  • Word Tree
  • Blowing the whistle at bubble charts via @prismatic
  • Feltron 2012 Annual Report
  • Periscopic: Do good with data
  • D3 and related

  • dimple - A simple charting API for d3 data visualisations
  • Coming from a matplotlib disliker and author of a Python->D3 tool: @jakevdp
  • TrestleJeff: My new Shiny App with custom D3 network output for gene networ
  • Interactive Data Visualization for the Web - OFPS - O'Reilly Media
  • 3.0 · mbostock/d3 Wiki · GitHub
  • D3, Conceptually - Lesson 1
  • D3.js - Data-Driven Documents
  • D3 Intro
  • New Bamboo - The leading Ruby on Rails developers, London, UK
  • D3.js 101
  • Building a lightweight, flexible D3.js dashboard for analytics
  • D3.js Gallery Data - temporarily in view mode - Google Docs
  • An alternative #d3js documentation… by @KDKTN
  • Let's Make a Bar Chart with Lyra - Jim Vallandingham
  • * Use bokeh and Seaborn instead of Vincent, * * * * * * * * * * * * * *


  • RobbyMeals: new post visualizing #Baltimore with #rstats and #ggplot2: http
  • chlalanne: Nice tutorial on #ggplot2, #rstats
  • hadleywickham: fast histograms for big data with ggplot2 and rcpp: http://t
  • Using cairographics with ggsave() | (R news & tutorials)
  • johnmyleswhite: The ggthemes package for #rstats looks very promising: http
  • Clegg vs Pleb: An XKCD-esque chart « Drunks&Lampposts
  • * * * *


  • TableauDataNerd: Great #Tableau performance tips from @Odbin that complemen
  • sqlbelle: 25 amazing Interactive graphics created with tableau | Design New
  • DGM885: Hey data geeks and Tableau vizzers. This looks both fun and intere
  • Tableau CEO: We're the Google of data visualization | Data visualization -
  • More @Tableau data connectors...
  • * *

    Developer Help

  • Installing Hadoop on OSX Lion (10.7) « Denny Lee
  • Love Your Terminal - Programming and Productivity
  • Hiring/Firing

  • Big Data News of the Week: Beautiful $300,000 Minds - Forbes
  • » Interview Questions for Data Scientists
  • @tdhopper - " Data scientists: "The hard part is data collection, cleaning, and problem formulation." D.S. giving an interview: "Tell me about math." "

    Hiring at Dropbox

    • Give a data set
    • Any two hours you want, whatever tools you want
    • Present at the end of it.
    • Key to make money in life: people will pay you money to do things they find tedious and boring and un-fun. Do it, and that's your career.

    Stats and Metrics

  • A Statistical Learning/Pattern Recognition Glossary
  • johnmyleswhite: Stumbled upon this nice looking book on statistical distrib
  • Carl Morris Symposium on Large-Scale Data Inference (2/3) | Civil Statistic
  • Carl Morris Symposium on Large-Scale Data Inference (1/3) | Civil Statistic
  • Carl Morris Symposium on Large-Scale Data Inference (3/3) | Civil Statistic
  • StatsModels: Statistics in Python — statsmodels documentation
  • » Cumulative Distribution Function (CDF) – Analyzing the Roll of Dice with
  • One Big Fluke › Why cohort analysis?
  • Not just one statistics interview...John McGready is the Jon Stewart of sta
  • Hard analysis, soft analysis — The Endeavour
  • Statistics: The Average | Statistics | Khan Academy
  • StatFact: RT @ProbFact: Huge probability distribution chart
  • Chart of distribution relationships
  • StatFact: On Chomsky and the Two Cultures of Statistical Learning http://t.
  • StatFact: Four assumptions of multiple regression that researchers should a
  • StatFact: New names for statistical methods
  • sayhitosean: The (near) Future of Data Analysis - A Review
  • CompSciFact: Summary of some of the math used in computer science http://t.
  • "Remember, nonlinear interactions, and non-Gaussian distributions are the default. "

    Causal Inference

    Time Series


  • StatFact: Approximate Bayesian Computation (ABC)
  • Unsupervised Learning, Data Mining, Association Rules, Basket Analysis

    ML / AI

  • The Curse of Dimensionality in Classification
  • Facebook’s Quest to Build an Artificial Brain Depends on This Guy | Enterpr
  • What I most wish I'd known as an undergrad: Go to public talks! : MachineLe
  • AMA: Yann LeCun : MachineLearning
  • "Stranger in a Strange Land" - @PaulMineiro on possible lessons #machine le
  • Terrific! Caffe, Berkeley's super fast #convnet library, just got BSD licen
  • peteskomoroch: NYU Large Scale Machine Learning Class
  • List of machine learning algorithms - Wikipedia, the free encyclopedia
  • Weka (machine learning) - Wikipedia, the free encyclopedia
  • Weka 3 - Data Mining with Open Source Machine Learning Software in Java
  • mitultiwari: A blog summarizing valuable practical machine learning tricks:
  • Machine Learning: Neural Network vs Support Vector Machine - Stack Overflow
  • S. M. Ali Eslami / Patterns for Research in Machine Learning
  • Scalable mean-shift clustering in a few lines of python | The Sociograph
  • Euclidean distance - Wikipedia, the free encyclopedia
  • Cosine similarity - Wikipedia, the free encyclopedia
  • Upgrading from Beta-Binomial to Logistic Regression « LingPipe Blog
  • Machine Learning — Introduction | Math ∩ Programming
  • StatFact: Compare machine learning algorithms at
  • briankmcdonald: Good read: Mike Milligan's Data Mining talk and Links: http
  • SQLRockstar: RT @MarkTabNet: Book: Journeys to Data Mining - Experiences fr
  • Kaggle President Jeremy Howard: Amateurs beat specialists in data-predictio
  • Probability Theory — A Primer | Math ∩ Programming
  • Peekaboo: Machine Learning Cheat Sheet (for scikit-learn)
  • List of Machine Learning and Data Science Resources - Part 2 | Conductrics
  • Description - The Marinexplore and Cornell University Whale Detection Chall
  • "ML tip: train a model to distinguish between your training set & unlabeled data. If it works, your training data may be incomplete!" - Jake Vanderplas


    Gradient Descent

    Dimensionality Reduction


    Markov Chains, MCMC

    Strong AI

    Recommendation Systems

  • Matrix Factorization Techniques For Recommender Systems |
  • Agent Systems, Networks, Adversarial ML, GANs, NP-Hard

    Reinforcement Learning


  • kaggle: Medley: new R package for blending regression models, developed by
  • Free SignalR book: #mvp13 #mvpbuzz
  • eddelbuettel: A first and simple Boost example with Rcpp at the Rcpp Galler
  • Visualizing Principal Components | (R news & tutorials)
  • Python

  • slendrmeans: ML for Hackers up to chapter 9, Pythonized:
  • Jupyter Notebook

    Scipy and Numpy




  • Latent Semantic Analysis (LSA) Tutorial
  • Foundations of Statistical Natural Language Processing
  • peteskomoroch: English Letter Frequency Counts: Mayzner Revisited http://t.
  • Text Classification and Feature Hashing: Sparse Matrix-Vector Multiplicatio
  • Second Try: Sentiment Analysis in Python : Andy Bromberg
  • List of 25 Natural Language Processing APIs | Mashape Blog
  • Text Processing in R
  • Image Processing

  • hmason: An algorithmic exploration of finding interesting parts of an image
  • Image Processing with scikit-image
  • Social Media, Graph Algorithms

  • iSchoolSU: .@MSFTResearch is Hosting a Workshop on Analyzing Social Media D
  • Deep Learning / Neural Networks

  • drewconway: Deep learning snag front page of NY Times, with nice quotations
  • What Hinton’s Google Move Says About the Future of Machine Learning « Some
  • Webcast: Deep Learning - The Biggest Data Science Breakthrough of the Decad
  • LSTM

    Capsule Networks

    Network Optimization, Linear Algebra, NP-hard

  • Study Hacks » Blog Archive » Mastering Linear Algebra in 10 Days: Astoundin
  • ML Tools, Hyperparameters, Model Stacking/Averaging


  • Introduction to Information Retrieval
  • Insight Data Science Fellows Program | eScience Institute
  • » Getting Started with Data Science
  • strataconf: #Strataconf 2013 Tutorials are up Core #d
  • CSS 490 / 590 - Introduction to Machine Learning
  • Two New Phd Tracks in Big Data | eScience Institute
  • Data Science Summit | Greenplum
  • The Data Science Venn Diagram « Zero Intelligence Agents
  • rossdawson: The most influential data scientists on Twitter
  • Free Datascience books
  • Why becoming a data scientist is NOT actually easier than you think - josep
  • Free Big Data Education: A Data Science Perspective - Daniel D. Gutierrez |
  • Funny

    Data Adoption and Data-Friendly Organizations

    "Not everything that can be counted counts, and not everything that counts can be counted.” - Einstein

  • Dan McKinley :: Whom the Gods Would Destroy, They First Give Real-time Anal
  • datachick: How to Organize a Data Science Team
  • How To Hire A Data Scientist « Bright Insights
  • CIOonline: Data Scientist Role Is Clear, Even If Job Description Isn't #dat
  • BigDataClub: Top CEOs Share how Big Data Is Transforming Our Health, Wealth
  • sqlbelle: Top story: Organizational Imperatives in the Era of Big Data - Ja
  • Developing Data Products
  • "So you're telling me that the most highly paid people in an organisation can only process information as complex as a traffic light?" - @ballantine70

    Promising #bigdata decency heuristic from @timoreilly : can worker see the data (and therefore get agency from it)? #SparkSummit

    Data Stacks, Data Engineering

    Do a blog post on the data flow paper

  • "Setting up an Apache Aurora/Mesos Cluster with Vagrant" blog post by @Nutt
  • Impala performance: Now faster on @ParquetFormat than an analytic DBMS on i
  • A Guide to Python Frameworks for Hadoop | Apache Hadoop for the Enterprise
  • peteskomoroch: Latest NoSQL LinkedIn Skills Index shows Apache Accumulo has
  • MongoDB: MongoDB: Architectural Best Practices from @softlayer
  • NoSQL: The Love Child of Google, Amazon and ... Lotus Notes | Wired Enterpr
  • daniel_jacobson: Netflix open sources dynamic query goodness for Amazon clo
  • sqlbelle: Big Data vs Data Warehousing
  • Big Data Platforms
  • Scalable Datastores
  • datachick: Microsoft's PolyBase mashes up SQL Server and Hadoop | ZDNet | @
  • The NoSQL Partition Tolerance Myth :: Hacking, Distributed
  • Probabalistic Data Structures

  • Why Bloom filters work the way they do | DDI
  • MapReduce, Hadoop, Hive

  • Why the days are numbered for Hadoop as we know it — Cloud Computing News
  • Cloudant Labs on Foundational MapReduce Literature | Cloudant
  • Hadoop Needs Better Bridges to Fulfill the Big Data Promise
  • peschkaj: "Facebook open sources Corona — a better way to do webscale Hadoo
  • Impala, Redshift, Distributed SQL, Dremel, BigQuery

  • Google's Dremel Makes Big Data Look Small | Wired Enterprise |
  • rickysaltzer/impala-in-action · GitHub
  • Google's BigQuery Gets Big Dashboards and Expanded Multiple Queries
  • Redshift: PostgreSQL-like in the cloud (benchmark) - RarestBlog
  • Spanner

  • Exclusive: Inside Google Spanner, the Largest Single Database on Earth | Wi
  • High Scalability - High Scalability - Google Spanner's Most Surprising Reve
  • Kafka

    Spark / BDAS

    TensorFlow, Tensor Algebra

    How-I-Did-It, Industries

  • How and why LinkedIn is becoming an engineering powerhouse
  • Lucky Oyster - Thought Pearls — Data Mining the Web: $100 Worth of Priceles
  • ogrisel: How @evernote uses scikit-learn to automatically tell recipes apar
  • Using deep learning to listen for whales — Daniel Nouri's Blog
  • RedMonk’s analytical foundations, part 1: 2003–2005 – The Story of Data
  • Phil_Factor: Setting up a Data Science Laboratory > (b
  • Meet the Obama campaign's $250 million fundraising platform
  • Romney campaign got its IT from Best Buy, Staples, and friends | Ars Techni
  • Measure Anything, Measure Everything « Code as Craft
  • Amazon Web Services Blog: AWS in Action - Behind the Scenes of a Presidenti
  • BigData_: What were the biggest big data stories in 2012? @derrickharris pi
  • How I made $500k with machine learning and HFT (high frequency trading) | J
  • Team ‘.’ takes 3rd in the Merck Challenge | no free hunch
  • Analyzing Twitter Data with Hadoop | Apache Hadoop for the Enterprise | Clo
  • digg: These are the geeks who won Obama four more years:
  • How Obama’s data scientists built a volunteer army on Facebook — Data | Gig
  • A Bayesian approach to Observing Dark Worlds | no free hunch
  • 1st Place: Observing Dark Worlds | no free hunch
  • drelu: Amex is using Mahout/Hadoop to model customer behavior to prevent fr
  • DGM885: Interesting interview w/ Robert Kosara at Data Stories. His resear