# 1. Introduction

## What is data mining?
 * Semi-automatic procedures to find general and useful patterns in large data sets.

## Applications
 * **Approximate retrieval**: Finding similar elements (similar songs, image search, plagiarism detection, copyright protection, etc.) in giant datasets
 * **Supervised learning**, such as large scale classification (of user behavior, of images, of text, etc.) and regression
 * **Unsupervised learning**, such as large scale clustering (search for groups of similar users, images, songs, articles, etc.) and dimension reduction
 * **Recommender systems** (bandit algorithms ($\epsilon$-greedy, UCB1, LinUCB, Hybrid LinUCB, etc.) and their applications in fields such as news article recommendation and adverting)
 * Others (monitoring transients in astronomy, spam filtering, fraud detection, machine translation, six degrees of ~~Kevin Bacon~~ separation etc.)
 
 
## Scale
 * Example: 10-100 TB of data per sky survey (astronomy)
 * Archive sizes measured in **petabytes**
 * Real-time data flows (e.g. computing trends in social network)
 * Data sources
     - science
     - commercial/civil/engineering
     - security/intelligence/defense

## Technical aspects
 * Want to keep data in main memory as much as possible (faster)
 * If data don't fir in the main memory, we have to access it in a streaming fashion. Random access would be much too expensive, so we have to adapt our algorithms in order to learn from streaming data.
 * Want *real-time analytics*
 * Want real-time synthesis
 * Want to leverage large-scale parallelism (across entire data centers)
 * Data quality often sucks (missing elements, missing elements represented as seemingly-present elements (null vs. "" vs. 0 vs. "\0" vs. undefined, etc.), inconsistent schema, etc.)
 * Need to respect users' privacy (control direct access to data.)

## Not covered
 * Systems issues (databases, data center management, etc.)
 * Specialized data structures
 * Domain specific algorithms
     - see **Information retrieval** course for more text-specific elements

## MapReduce
 * Works well with commodity hardware in data centers (DCs)
 * Failure-tolerant (redundancy over DC)
 * Works with distributed file systems (e.g. Google GFS, HDFS, etc.), which are optimized for durability, frequent reads and appends, but rare updates
 * `map(key, value)` and `reduce(key, values)` (bread and butter; other operations exist); the default shuffler does a lot of the grunt work!
 * **Partitions** the input data, **schedules** program execution across a set of machines, handles machine **failures**, and manages inter-machine **communication**
 * A job's output is often another job's input; many tools support multi-stage pipelines natively (Cascading, Scalding, Spark, MS DryadLINQ, etc.)
 * Stragglers (last few remaining reducers) $\implies$ spawn multiple copies of job and take the result of whoever finishes first
 * Hadoop is the most common MapReduce implementation; relies a lot on disk access $\implies$ slow; Spark offers massive speedups by relying less on disk access

Trick to compute variance in one pass: use formula based on expectation ($\mathbb{V}ar(X) = \hat{\mathbb{E}}[X^2] - \hat{\mathbb{E}}[X]^2$).

GPGPUs can also offer massive speed-ups when used right. They are not covered in this course, but are very widely used for algorithms requiring heavy number-crunching (many vector/matrix operations).