# Crawl Data Analysis: Clustering

This notebook tries various clustering techniques on our web crawl data. It was written for Python 2.7, and assumes it's running on cycles. Note that you will need sklearn >= 0.20 and numpy >= 1.16.0. You can view/edit the notebook remotely as follows:

- Clone the GitHub repo to cycles (e.g. spin.cs.princeton.edu)
- Start up this notebook. Jupyter is not installed globally, but you can install it locally with pip via `pip install --user jupyter`. Then you can run this notebook in a tmux session: `tmux`, then `cd [this directory]`, then `jupyter notebook --no-browser --port 8889` (note that you can choose whatever port number you want, but we'll assume from here on it's 8889). Copy the URL generated - this is the URL you'll visit in your browser to open the notebook. Then Ctrl-B, D to detach the tmux session, and log out of cycles.
- On your local machine, forward your local port 8889 to the remote port 8889 on cycles: `ssh -L 8889:localhost:8889 [netid]@spin.cs.princeton.edu`
- Now you can open the notebook in your browser by pasting the link you copied earlier.

In [90]:
from __future__ import print_function
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import os
import pandas as pd

## Read from database

Read the crawl data from the database. Here we read in the `site_visits` and `segments` tables and join them.

In [91]:
import sqlite3

db = '/n/fs/darkpatterns/final-crawl/webtap/webtap.sqlite'
con = sqlite3.connect(db)

In [92]:
site_visits = pd.read_sql_query('''SELECT * from site_visits''', con)

In [97]:
print('Number of site visits: %s' % str(site_visits.shape))
print('site_visits columns: %s' % str(list(site_visits.columns.values)))

Number of site visits: (26684, 4)
site_visits columns: ['visit_id', 'crawl_id', 'site_url', 'domain']


In [104]:
# Print some sites
pd.options.display.max_colwidth = 1000
r = np.random.choice(np.arange(site_visits.shape[0]), size=25, replace=False)
print(site_visits['site_url'][r])

with open('output/some_sites.txt', 'w') as f:
    f.write(str(site_visits['site_url'][r]))

3380                                                                                                                          https://www.macgamestore.com/showcase/Sega-Best-Of-2018/
11463                                                                                                                     https://ekster.com/products/pocket-strap?variant=36566975436
2292                                                                                https://techarmor.com/apple-iphone-7-plus-iphone-8-plus-hd-clear-film-screen-protector-3-pack.html
7995                                                                                                                      https://bemz.com/articles/models/sofa-covers/snd1/?from=5071
21275                                 https://www.wulflund.com/army-outdoor-shop/clothing-army-outdoor-police/military-patches/walhalla-ticket-3d-blackmedic-rubber-velcro-patch.html/
5463                                                                                 

Compute how many unique domains we have.

In [94]:
from urlparse import urlparse

site_visits['domain'] = site_visits['site_url'].apply(lambda x: urlparse(x).netloc)
grouped = site_visits.groupby(['domain']).count().sort_values('visit_id', ascending=False)
print('Number of unique domains: %s' % str(grouped.shape[0]))

Number of unique domains: 5732


In [None]:
segments = pd.read_sql_query('''SELECT * from segments''', con)

In [107]:
print('Number of segments: %s' % str(segments.shape))
print('segments columns: %s' % str(list(segments.columns.values)))

NameError: name 'segments' is not defined

In [None]:
segments_subset = segments.reset_index()[['top', 'left', 'width', 'height', 'inner_text', 'style']].set_index('visit_id')
site_visits_subset = site_visits.reset_index()[['visit_id', 'site_url', 'domain']].set_index('visit_id')
segments = segments_subset.join(site_visits_subset, how='inner')
# segments = segments.reset_index().set_index('visit_id').join(site_visits.reset_index()[['visit_id', 'site_url', 'domain']].set_index('visit_id'), how='inner')

In [None]:
print('Number of segments: %s' % str(segments.shape))
print('segments columns: %s' % str(list(segments.columns.values)))

## Preprocess data

Ignore `body` tags and null `inner_text`.

In [29]:
segments['inner_text'] = segments['inner_text'].str.strip()
segments = segments[(segments['node_name'] != 'BODY') & (segments['inner_text'] != '')]

Add a new column for number of newlines in each segment.

In [30]:
segments['newline_count'] = segments['inner_text'].apply(lambda x: len(x.split('\n')))

Apply some standard techniques in preprocessing string data (ref: [Kdnuggets article](https://www.kdnuggets.com/2017/06/text-clustering-unstructured-data.html)):

- Lower case
- Replacing numbers with a placeholder, and replace units of measure with a placeholder (e.g kg, sq ft, gb, etc.)
- Removing punctuation
- Removing excess whitespace
- Removing known "stop words"/"stop phrases"/generic words (articles, conjunctions, etc.)
- Stemming (reducing words to their stems, i.e. via Porter's stemming algorithm)

In [31]:
from nltk.stem.porter import PorterStemmer
import nltk
import re
import string

nltk.download('stopwords')

stemmer = PorterStemmer()
stopwords = set(nltk.corpus.stopwords.words('english'))
stopwords = stopwords.union(set(string.punctuation))

def preprocess(s):
    s = s.lower()
    s = re.sub(r'\d+', 'dpnum', s)
    s = re.sub(r'[^a-z\s]', '', s)
    s = re.sub(r'\s+', ' ', s)
    words = s.split()
    words = [stemmer.stem(w) for w in words if len(w) > 0 and w not in stopwords]
    
    # Optimization to get rid of suffixes (e.g. units of measure)
    for i in range(len(words)):
        if words[i].startswith('dpnum'):
            words[i] = 'dpnum'
    
    return ' '.join(words)

[nltk_data] Downloading package stopwords to /u/mjf4/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [32]:
# Sanity check
s = '''Color   
Choose an option
Silver
Space Gray

Size    
Choose an option
64 GB'''
print('ORIGINAL STRING:')
print(s)
print()
print('PREPROCESSED STRING:')
print(preprocess(s))

ORIGINAL STRING:
Color   
Choose an option
Silver
Space Gray

Size    
Choose an option
64 GB

PREPROCESSED STRING:
color choos option silver space gray size choos option dpnum gb


In [33]:
segments['inner_text_processed'] = segments['inner_text'].apply(preprocess)
segments['longest_text_processed'] = segments['longest_text'].apply(preprocess)

Add new columns for length of original text and processed text.

In [34]:
segments['inner_text_length'] = segments['inner_text'].apply(lambda x: len(x))
segments['inner_text_processed_length'] = segments['inner_text_processed'].apply(lambda x: len(x))
segments['longest_text_length'] = segments['longest_text'].apply(lambda x: len(x))
segments['longest_text_processed_length'] = segments['longest_text_processed'].apply(lambda x: len(x))

In [35]:
new_cols = ['newline_count', 'inner_text_length', 'inner_text_processed_length',
            'longest_text_length', 'longest_text_processed_length']
for c in new_cols:
    print('segments[\'%s\'].describe():\n%s' % (c, segments[c].describe().to_string()))

segments['newline_count'].describe():
count    740532.000000
mean          3.981557
std          55.266438
min           1.000000
25%           1.000000
50%           1.000000
75%           1.000000
max        5002.000000
segments['inner_text_length'].describe():
count    740532.000000
mean        108.879522
std        3206.474593
min           1.000000
25%           5.000000
50%          14.000000
75%          35.000000
max      391699.000000
segments['inner_text_processed_length'].describe():
count    740532.000000
mean         88.379450
std        3052.305817
min           0.000000
25%           5.000000
50%          12.000000
75%          26.000000
max      378761.000000
segments['longest_text_length'].describe():
count    740532.00000
mean         22.73942
std          50.14175
min           0.00000
25%           2.00000
50%           9.00000
75%          23.00000
max        4145.00000
segments['longest_text_processed_length'].describe():
count    740532.000000
mean         16.566

Remove redundant segments.

In [36]:
segments = segments.groupby(['domain']).apply(lambda x: x.drop_duplicates(subset=['inner_text_processed'], keep='last'))

In [37]:
print('Number of segments: %s' % str(segments.shape))
print('segments columns: %s' % str(list(segments.columns.values)))

Number of segments: (72072, 31)
segments columns: ['index', 'id', 'crawl_id', 'node_name', 'node_id', 'top', 'left', 'width', 'height', 'style', 'inner_text', 'outer_html', 'longest_text', 'longest_text_width', 'longest_text_height', 'longest_text_top', 'longest_text_left', 'longest_text_style', 'num_buttons', 'num_imgs', 'num_anchors', 'time_stamp', 'site_url', 'domain', 'newline_count', 'inner_text_processed', 'longest_text_processed', 'inner_text_length', 'inner_text_processed_length', 'longest_text_length', 'longest_text_processed_length']


Select a random subset of the data to make clustering more tractable during trial-error.

In [None]:
segments_orig = segments.copy(deep=False)
indices = np.random.choice(np.arange(segments.shape[0]), replace=False, size=25000)
segments = segments.iloc[indices,:]

In [None]:
print('segments shape: %s' % str(segments.shape))

## Create feature vectors

First we define the a function to tokenize text as we convert text into feature vectors. 

In [38]:
from nltk.stem.porter import PorterStemmer
import nltk

nltk.download('stopwords')

stemmer = PorterStemmer()
stopwords = nltk.corpus.stopwords.words('english')

def tokenize(line):
    if (line is None):
        line = ''
    tokens = [stemmer.stem(t) for t in nltk.word_tokenize(line) if len(t) != 0 and t not in stopwords and not t.isdigit()]
    return tokens

[nltk_data] Downloading package stopwords to /u/mjf4/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now select one of the following cells to run to create a feature representation. Either load from the pre-existing file or recompute the features.

### 1. Bag of words

In [None]:
if os.path.isfile('output/features_bow.npy'):
    features = np.load('output/features_bow.npy')
    print('Loaded from file')
else:
    print('No pre-existing file')

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize

data = segments['inner_text_processed']
vec = CountVectorizer(tokenizer=tokenize, binary=binary_rep, strip_accents='ascii').fit(data)

In [None]:
print('Length of vocabulary %s' % str(len(vec.vocabulary_)))

In [None]:
vec = vec.transform(data)
features = normalize(vec, axis=0)
np.save('output/features_bow.npy', features)

### 2. TFIDF

Optionally load from a file (enable that cell and disable the others that compute the TFIDF features if you have a file present).

In [10]:
if os.path.isfile('output/features_tfidf.npy'):
    features = np.load('output/features_tfidf.npy')
    print('Loaded from file')
else:
    print('No pre-existing file')

Loaded from file


In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('punkt')
data = segments['inner_text_processed']
vec = TfidfVectorizer(binary=False, strip_accents='ascii', max_features=1000).fit(data)
features = vec.transform(data)
np.save('output/features_tfidf.npy', features)

[nltk_data] Downloading package punkt to /u/mjf4/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [40]:
print('features shape (num_examples, num_features): %s' % str(features.shape))
print('Length of vocabulary: %s' % str(len(vec.vocabulary_)))

features shape (num_examples, num_features): (72072, 1000)
Length of vocabulary: 1000


In [41]:
sorted_vocab = sorted([(t, f) for t, f in vec.vocabulary_.iteritems()], cmp=lambda x, y: x[1] - y[1])
print('Vocabulary')
print('\n'.join(['%d: %s' % (f, t) for t, f in sorted_vocab]))

Vocabulary
0: abiet
1: abl
2: ac
3: acacia
4: accept
5: access
6: accessori
7: account
8: ace
9: acenaphthi
10: aci
11: acid
12: acryl
13: acrylamid
14: activ
15: ad
16: adapt
17: add
18: addit
19: address
20: adjust
21: admi
22: adpnum
23: advanc
24: affili
25: age
26: ago
27: agre
28: air
29: al
30: ale
31: allow
32: also
33: aluminum
34: alway
35: amaz
36: amazon
37: america
38: american
39: ammonium
40: amount
41: anoth
42: answer
43: app
44: appl
45: appli
46: applic
47: area
48: arm
49: around
50: arriv
51: art
52: artist
53: ask
54: assist
55: attach
56: auction
57: audio
58: automat
59: avail
60: away
61: azi
62: babi
63: back
64: backord
65: bag
66: balanc
67: ball
68: band
69: bank
70: bar
71: barium
72: barrel
73: base
74: basic
75: basket
76: batteri
77: bear
78: beauti
79: beck
80: bed
81: belt
82: ben
83: bentonit
84: benzenesul
85: benzyl
86: best
87: better
88: bi
89: big
90: bike
91: bill
92: bisphenol
93: bit
94: black
95: blanc
96: blend
97: block
98: blog
99: blou
1

### 3. Word Vectors

We compute a vector for each segment as follows: compute the word vector for each word in the segment's `inner_text`, and then average over all words in that segment.

While it's simple, there are clearly downsides to this approach:

- We lose information about word ordering
- All words are equally weighted, so words that really characterize the text are not prioritized

In [None]:
if os.path.isfile('output/features_wordvec.npy'):
    features = np.load('output/features_wordvec.npy')
    print('Loaded from file')
else:
    print('No pre-existing file')
    import en_core_web_sm

    data = segments['inner_text_processed']
    nlp = en_core_web_sm.load()
    vecs = []
    for doc in nlp.pipe(data.str.replace(r'\d+', '').astype('unicode').values, batch_size=10000, n_threads=7):
        if doc.is_parsed:
            vecs.append(doc.vector)
        else:
            vecs.append(None)
    features = np.array(vecs)
    np.save('output/features_wordvec.npy', features)
    
print('features shape: %s' % str(features.shape))

### PCA
Try using PCA to reduce the dimension of the data.

The feature matrix is expected to be provided with examples in rows (`num_examples` x `num_features`).

Projected data is given by $U^T X$, where $U$ is matrix with PCs in columns (`orig_dim` x `reduced_dim`), and $X$ is the data matrix with examples in columns (`orig_dim` x `num_examples`). In our case, $U^T$ is `pca.components_` and $X$ is `features.T`.

In [None]:
if os.path.isfile('output/features_proj.npy'):
    features = np.load('output/features_proj.npy')
    print('Loaded from file')
else:
    print('No pre-existing file')
    from sklearn.decomposition import PCA
    
    pca = PCA(n_components=5)
    # pca = PCA(tol=10)
    pca.fit(features)
    
    print('Matrix of PCs: %s' % str(pca.components_.shape))
    print('Data matrix: %s' % str(features.shape))
    print('%d singular values: %s' % (pca.singular_values_.shape[0], str(pca.singular_values_)))
    
    features = np.dot(pca.components_, features.T)
    features = features.T
    
    np.save('output/features_proj.npy', features)
    
print('feature matrix shape (after PCA): %s' % str(features.shape))

Plot the data in the reduced dimension (assuming new dimension is 3, otherwise this is meaningless).

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(features[:,0], features[:,1], features[:,2])

## Clustering

Run one of the following clustering algorithms.

### 1. Hierarchical clustering

In [None]:
from scipy.spatial import distance
import fastcluster

# featdense = features.todense()
distances = distance.pdist(features, metric='cosine')
distances = distance.squareform(distances, checks=False)

In [None]:
clusters = fastcluster.linkage(distances, method='ward', preserve_input=False)
np.save('output/hierarchical_linkage_matrix.npy', clusters)

Plot a dendogram of the resulting clusters.

In [None]:
from scipy.cluster.hierarchy import dendrogram

plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    clusters,
    leaf_rotation=90.,
    leaf_font_size=8.,
)
plt.show()

### 2. DBSCAN clustering

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.externals.joblib.parallel import parallel_backend

with parallel_backend('threading'):
    clusterer = DBSCAN(eps=0.0001, min_samples=3, n_jobs=10, metric='euclidean')
    cluster_labels = clusterer.fit(features)

In [None]:
segments['cluster_dbscan'] = pd.Series(cluster_labels.labels_).values

In [None]:
print('Number of clusters: %d' % len(set(cluster_labels.labels_)))
print('segments[\'cluster_dbscan\'].value_counts(): \n %s' % segments['cluster_dbscan'].value_counts().to_string())

### 3. HDBSCAN clustering

In [50]:
from sklearn.preprocessing import normalize
import hdbscan

features = normalize(features, axis=1) # Normalize each segment since using euclidean distance metric
clusterer = hdbscan.HDBSCAN(min_cluster_size=5, metric='euclidean')
cluster_labels = clusterer.fit_predict(features)
np.save('output/hdbscan_cluster_labels.npy', cluster_labels)
segments['cluster_hdbscan'] = pd.Series(cluster_labels).values

In [51]:
print('segments[\'cluster_hdbscan\'].value_counts(): \n %s' % segments['cluster_hdbscan'].value_counts().to_string())

segments['cluster_hdbscan'].value_counts(): 
 -1       26227
 56       9922
 1561     3131
 25        737
 42        734
 32        704
 16        335
 1239      274
 1484      218
 34        183
 47        182
 40        182
 663       160
 1195      159
 977       158
 1012      156
 18        147
 1483      143
 1499      143
 1496      140
 1549      133
 1518      132
 50        126
 1211      125
 631       123
 1409      122
 1231      119
 1451      118
 1421      114
 1509      113
 1177      113
 730       109
 1490      105
 1232      104
 1506      104
 1432      101
 1530      101
 844       100
 22         94
 361        93
 48         88
 1053       86
 1201       85
 1242       85
 640        83
 987        83
 1256       79
 618        78
 658        78
 1157       78
 693        77
 1223       77
 1457       76
 1379       75
 764        75
 676        75
 1482       74
 1050       74
 1088       73
 959        72
 1309       71
 46         71
 1548       70
 1474    

### 4. Expectation Maximization (EM)

Reference/inspiration found [here](https://suif.stanford.edu/~livshits/papers/pdf/uwpc.pdf) and [here](https://www.datascience.com/blog/k-means-alternatives).

The form of EM we use here can be thought of as a generalization of K-means clustering, using "soft" clusters instead of "hard" ones. In relation to K-means: at each iteration, rather than assign every data point to a single cluster, EM computes probability that the point belongs to each cluster, giving a list of probabilities for each data point. Then parameters are updated to maximize the probability of having this distribution of the data.

In [None]:
from sklearn.mixture import GaussianMixture

em = GaussianMixture(n_components=100, max_iter=15, tol=0.05, random_state=13, verbose=2, verbose_interval=1)
em.fit(features.todense())

In [None]:
cluster_labels = em.predict(features.todense())

In [None]:
segments['cluster_em'] = pd.Series(cluster_labels).values

In [None]:
np.save('output/em_cluster_labels.npy', cluster_labels)

In [None]:
print('Number of clusters: %d' % len(set(cluster_labels)))
print('segments[\'cluster_em\'].value_counts(): \n %s' % segments['cluster_em'].value_counts().to_string())

## Evaluate Performance

Below we compute a few performance metrics on the clusters to get a sense of how well each method did.

In [53]:
segments['cluster'] = segments['cluster_hdbscan']

### 1. Silhouette Coefficient

Ref: https://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient

Computes for each sample:

$$\frac{b - a}{\max(a, b)}$$

where $a$ is the mean distance between this point and other points in its assigned cluster, and $b$ is the mean distance between this point and other points in the next nearest cluster (based on a distance metric, which is a parameter, such as Euclidean distance).

**TL;DR** Ranges between -1 (poorly assigned labels) and +1 (highly dense and separated clusters). In particular, a score near 0 indicates overlapping clusters.

In [None]:
from sklearn import metrics

s = metrics.silhouette_score(features, segments['cluster'], metric='euclidean')

In [None]:
print('Silhouette score: %.3f' % s)

### 2. Calinski-Harabaz Index

Ref: https://scikit-learn.org/stable/modules/clustering.html#calinski-harabaz-index
    
Computes the ratio of between-cluster dispersion to within-cluster dispersion. "Between-cluster dispersion" is is computed by taking the weighted sum of "scatter matrices" between cluster centroids and the centroid of the entire dataset:
$$\sum_q n_q (c_q - c)(c_q - c)^T$$
where $n_q$ is the number of points in cluster $q$, $c_q$ is the centroid of cluster $q$, and $c$ is the center of all the data; and then taking the trace.

"Within-cluster dispersion" for a single cluster is an analogous measure using the scatter matrix over that cluster:
$$\sum_{x \in \text{cluster } q} (x - c_q)(x - c_q)^T$$
where $c$ is the cluster centroid. Total dispersion is computed by summing these scatter matrices over all clusters and taking the trace.

**TL;DR** Larger values are better and indicate denser, more separated clusters. Values near 0 indicate overlapping clusters. This is also the fastest metric to compute, since it's very fast to generate scatter matrices (as opposed to the other metrics, which must compute many pairwise distances).

In [54]:
from sklearn import metrics
from scipy.sparse import issparse

if issparse(features):
    chi = metrics.calinski_harabaz_score(features.todense(), segments['cluster'])
else:
    chi = metrics.calinski_harabaz_score(features, segments['cluster'])

In [55]:
print('Score: %.3f' % chi)

Score: 40.797


### 3. Davies-Bouldin Index

Ref: https://scikit-learn.org/stable/modules/clustering.html#davies-bouldin-index

Computes the average "similarity" between each cluster and its most similar one. Similarity between clusters $i$ and $j$ is $R_{ij}$ and is computed as:
$$R_{ij} = \frac{s_i + s_j}{d_{ij}}$$
where $s_i$ is the average distance from $i$'s centroid to its points ("cluster diameter"), and $d_{ij}$ is the distance between cluster centroids. Then the DB index is the average:
$$\frac{1}{k}\sum_{i} \max_{j \neq i} R_{ij}$$

**TL;DR** Similarity values closer to 0 indicate better separation of clusters. Thus DB values closer to 0 are better, and larger values indicate sparse and/or overlapping clusters.

In [None]:
from sklearn.metrics import davies_bouldin_score

dbi = davies_bouldin_score(features, segments['cluster'])

In [None]:
print('Score: %.3f' % dbi)

## Visualize results

Produce a CSV file that shows the segments in each cluster.

In [56]:
inner_texts = segments['inner_text']
cluster_labels = segments['cluster']
urls = segments['site_url']
print("segments['inner_text'] is %s, segments['cluster'] is %s, segments['site_url'] is %s (should be the same)" % (str(inner_texts.shape), str(cluster_labels.shape), str(urls.shape)))

segments['inner_text'] is (72072,), segments['cluster'] is (72072,), segments['site_url'] is (72072,) (should be the same)


Group the segments by cluster.

In [57]:
from collections import defaultdict

inner_text_by_cluster = defaultdict(lambda: [])
url_by_cluster = defaultdict(lambda: [])
for i in range(inner_texts.shape[0]):
    inner_text_by_cluster[str(cluster_labels[i])].append(inner_texts[i])
    url_by_cluster[str(cluster_labels[i])].append(urls[i])

Write CSV file.

In [None]:
import unicodecsv as csv
from datetime import datetime

timestamp = '_'.join(str(datetime.now()).split(' '))
outfile = 'output/clusters-%s.csv' % timestamp
with open(outfile, 'wb') as f:
    writer = csv.writer(f)
    for cluster in inner_text_by_cluster.keys():
        segments_str = '\n\n'.join(['%s:\n%s' % (u, t) for u, t in zip(url_by_cluster[cluster], inner_text_by_cluster[cluster])])
        writer.writerow([cluster, segments_str, url_by_cluster[cluster]])

Print some clusters.

In [58]:
sorted_clusters = sorted([(c, texts) for c, texts in inner_text_by_cluster.iteritems()], cmp=lambda x, y: len(y[1]) - len(x[1]))

In [64]:
c, texts = sorted_clusters[9]
urls = url_by_cluster[c]
print('Cluster %s (length %d)' % (c, len(texts)))
print('----------')
print('\n\n'.join(['%s:\n%s' % (u, t) for u, t in zip(urls, texts)]))
print()

Cluster 34 (length 183)
----------
http://www.scannermaster.com/Uniden_Bearcat_SDS100_Police_Scanner_p/10-501979.htm:
Manuals and printed materials
USB AC Power Adapter
5400 mAH Lithium Battery (New larger battery now shipping, replaces the older 3600 mAH battery)
SMA Rubber Duck Antenna
USB Mini Cable (for programming and powering)
Sweil Belt Clip
SMA-BNC Adapter
Hand Strap

http://www.scannermaster.com/Uniden_Bearcat_SDS100_Police_Scanner_p/10-501979.htm:
Customizable Color Display
Trunktracker X
APCO P25 Phase I and II
Motorola, EDACS, and LTR Trunking
MotoTRBO Capacity + and Connect +**
DMR Tier III**
Hytera XPT**
Single-Channel DMR**
NXDN 4800 and 9600**
EDACS ProVoice**
Location-Based Scanning
USA/Canada Radio Database
ZIP Code Selection for Easy Setup
Close Call™ RF Capture with Do Not Disturb
8 GB microSD
Soft Keys for Intellegent UI
Recording, Playback, and Replay
Temporary Avoid
Fire Tone-Out Alert
System Analysis and Discovery
CTCSS/DCS/NAC/RAN/Color Code Decoding
S.A.M.E. W