### Word count representation for measuring similarity
- Bag of words model
 - Ignore order of words
 - Count # of instances of each word in vocabulary
 - ex
   - "Carlos calls the sport futbol. Emily calls the sport soccer."
   - `1222111`
   - `Carlos / the / calls / sport / futbol / soccer / Emily`
 
- Measuring similarity
 - dot product
   - `15300100`
   - `32001000`
   - similarity : $1 \cdot 3 + 5 \cdot 2 = 13$
   
- Issues with word counts -> Doc length
- Solution -> normalize
 - compute each **norm vector**
   - `15300100` -> $\sqrt{1^2 + 5^2 + 3^2 + 1^2}$

### Prioritizing important words with tf-idf
- Issues with word counts -> Rare words
 - Common words in doc: "the", "player", field", "goal"
 - Dominate rare words like: "futbol", "Messi"

- Document frequency
 - What characterizes a *rare word?*
   - Appears *infrequently* in the corpus
 - Emphasize words appearing in *few docs*
   - Equivalently, discount word $w$ based on *# of docs containing $w$ in corpus*
 
- Important words
 - Do we want only rare words to dominate?
 - What characterizes as *important word?*
   - Appears frequently in document ( *common locally* )
   - Appears rarely in corpus ( *rare globally* )
 - Trade off between **local frequency** and **global rarity**

### Calculating tf-idf vectors
- TF-IDF ( Term frequency - inverse document frequency )
 - Term frequency
   - same as word count
 - Inverse document frequency
   - $log{ \frac{\text{# docs}}{1 + \text{# docs using word}} }$
   
- IDF
 - if. word in many docs -> $log{ \frac{\text{large #}}{1 + \text{large #}} } = log{1} = 0$
 - if. rare word -> $log{ \frac{\text{large #}}{1 + \text{small #}} } = \text{large #}$

- example
 - Term frequency
   - `the / Messi`
   - `1000 / 5`
 - Inverse document frequency
   - "the" -> $log{\frac{64}{1+63}} = 0$
   - "Messi" -> $log{\frac{64}{1+8}} = log{16} = 4$
   - `the / Messi`
   - `0 / 4`
 - TF-IDF
   - elementwise multiply: `0 / 20`
   - weight of "the" is going to be smalle
   - weight of "Messi" is going to be large

### Retrieving similar documents
- Nearest neighbor search
 - Query article
 - Corpus
 - **Specify**: Distance metric
 - **Output**: Set of most similar articles
 
- 1 - Nearest neighbor
 - Input: Query article `A`
 - Output: **Most** similar article `Most`
 - Algorithm:
   - Search over each article `B` in corpus
     - Compute **s = similarity(`A`, `B`)**, also `C`, `D`, ...
     - If **S > Best_s**, record `Most` = `D` and set **Best_s = s**
   - Return 
   
- k - Nearest neighbor
 - Input: Query article `A`
 - Output: **List of k** similar articles

### Clustering documents overview
- Structure documents by topic
 - Discover groups (clusters) of related articles: Sports, World news..
- What if some of the labels are known?
 - Training set of labeled docs
- Multiclass classification problem
 - supervised learning
 
### Clustering documents: An unsupervised learning task
- Clustering
 - No labels provided
 - Want to uncover cluster structure
 - **Input**: docs as vectors
 - **Output**: cluster labels
 - An **unsupervised learning** task

- What defines a cluster?
 - Cluster defined by **center & shape/spread**
 - Assign observation (*doc*) to cluster (*topic label*)
   - Score under cluster is higher than others
   - Often, just more similar to assigned cluster center than other cluster centers

### k-means
- Assume
 - Similarity metric = **distance to cluster center** (smaller better)
- algorithm
 1. Initialize cluster centers (random choice)
 2. Assign observations to closest cluster center
   - Voronoi tessellation
 3. Revise cluster centers as mean of assigned observations
 4. Repeat 1. + 2. until convergence

### Other examples
- Clustering images
 - For search, group as: Ocean, Pinnk flower, Dog, ...
- Grouping patients by medical condition
 - Better characterize subpopulations and diseases
- Products on Amazon
 - Discover product categories from purchase histories
 - Recommender System
 - Or discovering groups of users
- Structuring web search results
 - Search terms can have multiple meanings
- Discovering similar neighborhoods
 - Task 1: Estimate price at a small regional level
  - Challenge: Only a few(or no!) sales in each region per month
  - Solution: Cluster regions with similar trends and share information within a cluster
 - Task 2: Forecast violent crimes to better task police
  - Again, cluster regions and share information!
  - Leads to improved predictions compared to examining each region independently

### Clustering and similarity ML block diagram
- `Training Data`( doc id, doc text table ) -> `Featrue extraction` -> $x$ ( tf-idf )
- $x$ -> `ML model(clustering)` ( $\hat{w}$: cluster centers ) -> $\hat{y}$ ( estimated cluster label )
- real `y` is not here (unsupervised)
- `Quality metric` -> `ML algorithm (k-means)` -> distances( $x, \hat{w}$ ) -> $\hat{w}$
 - loop, updating for minimizing sum of distances