# To Down Sample or Not
Sample is an easy hyper parameter to set, but should you use it at all?

   `sample (float, optional) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).`
   
The parameter attempts to approximate the effect of removing stopwords, which are words of the highest frequency that don't contribute much in the way of value to a context. Here's the effect of the sample parameter in practice:

INFO : collected 6089246 word types from a corpus of 1566076356 raw words and 83830773 sentences

INFO : sample=1e-05 downsamples 4072 most-common words

INFO : collected 79260 word types from a corpus of 10277876 raw words and 593596 sentences

...

0.00001  # sample=1e-05 downsamples 4158 most-common words

...

sample=0.001 downsamples 32 most-common words
    
    
## Let's look at how using it could change your results:
We'll scan our wikipedia English corpus and count the occurences of distinct words. Then we'll pull the top 200 words and see how many would be down sampled and erased from vector processing.

In [3]:
! wc -l wikimedia.en.processed.cor

 84231634 wikimedia.en.processed.cor


In [4]:
! wc -w wikimedia.en.processed.cor

 1566161633 wikimedia.en.processed.cor


In [2]:
from collections import Counter
counter = Counter()

with open('wikimedia.en.processed.cor', 'rt') as reader:
    for line in reader:
        words =line.strip().split()
        for word in words:
            counter.update({word:1})            

counter.most_common(200)

### Although many linking words of dubious value are in the top 200, there are many that likely provide valuable information links, such as:
* ('New', 1599801),
* ('United', 1312294),
* ('University', 1228325),
* ('American', 1160728),
* ('town', 662736),
* ('song', 660788),
* ('public', 651966),
* ('building', 650227)

# Better is to filter stopwords out during preprocessing the corpus, and not to set a down sample rate.