# Downloading files from Olympus Mount

Most of our current collections are stored in `/scratch/olympus/` as bzipped (.bz2) json files.<br>
To make a collection useful, we need to uncompress the files, and dust any corrupt records.

This is something we don't want to do where files are stored.<br>
So, we're going to copy the collection into our personal scratch space -- `/scratch/`$`USER`,<br> 
which has 5TBs of storage, and can be messed up however you want!
<hr>
### A few words before we start
This is a Jupyter notebook. It contains html markdown (this current cell), as well as Python code.<br>
I highly recommend making a copy of the .ipynb file, and renaming it!<br>
To follow along, you can run each cell using (shift + enter).

To use this Jupyter notebook, you should be on a CPU node on NYU HPC's [Prince Cluster](https://wikis.nyu.edu/display/NYUHPC/Getting+started+on+Prince).<br>
Use the following [sbatch](https://wikis.nyu.edu/display/NYUHPC/Slurm+Tutorial) [script](https://github.com/SMAPPNYU/smapputil/blob/master/sbatch/cpu-jupyter.sbatch) to connect to a CPU node on Prince.

Scratch is a temporary space. Make sure any transformations and analysis outside this notebook are reproducible.

Author @yinleon

<hr>
The following cell contains the packages we're going to use. <br>
Importing packages are typically the first code cell of a notebook.

the first 7 packages are standard Python packages, [smappdragon](https://github.com/SMAPPNYU/smappdragon) is not. <br>
Third party packages power the Python ecosystem but need to be [explicitly downloaded](https://github.com/SMAPPNYU/smappdragon#installation).

In [1]:
import os
import glob
from itertools import repeat
from shutil import copyfile
from subprocess import Popen, PIPE
from multiprocessing import Pool
from time import sleep

from smappdragon.tools.tweet_cleaner import clean_tweets

What collections exist on scratch?<br>
Jupyter Notebooks can execute [magic commands](https://ipython.org/ipython-doc/3/interactive/magics.html), using the `%` symbol.

In [2]:
%ls /scratch/olympus | head -15

[0m[38;5;27m47_traitors[0m/
[38;5;27marab_events_2[0m/
[38;5;27marab_events_3_2016[0m/
[38;5;27margentina_politicians[0m/
[38;5;27mattack_on_mcfaul[0m/
[38;5;27mbrexit_2015[0m/
[38;5;27mbrexit_2016[0m/
[38;5;27mbritain_broadcast_journalists[0m/
[38;5;27mbritain_broadcast_journalists_2016[0m/
[38;5;27mbritain_election_2015[0m/
[38;5;27mbritain_election_2016[0m/
[38;5;27mbritain_isis[0m/
[38;5;27mbritain_national_journalists[0m/
[38;5;27mbritain_national_journalists_2016[0m/
[38;5;27mbritain_parties[0m/


from the above, which collection do you want on your `/scratch/`$`USER` space?

In [3]:
# change this! Afterwards click: Cell -> Run All... if ydgaf!
collection_name = 'brexit_2016'

after packages, global variables are declared...<br>
below are the variables we'll be using, which will be generated dynamically based on `collection_name`, <br>
and the `$`USER environment variable stored on your Prince account.

In [4]:
netid = os.environ.get('USER')
olympus_tweets = '/scratch/olympus/{}/data'.format(collection_name)
olympus_local = '/scratch/{}/olympus_local/'.format(netid)
collection_local = olympus_local + collection_name

if not os.path.exists(olympus_tweets):
    print("the collection: {} does not exist".format(olympus_tweets))

In [5]:
print("files will be stored here: {}".format(collection_local))

files will be stored here: /scratch/ly501/olympus_local/brexit_2016


Let's make the directories where data will go...

In [6]:
if not os.path.exists(olympus_local):
    os.makedirs(olympus_local)
if not os.path.exists(collection_local):
    os.makedirs(collection_local)

## Functions
Because each file from Olympus will undergo the same operations, we can make our workflow modular.<br>
In Python, functions are declared using the `def some_function(arg_1, arg_2)` format.

In [7]:
def add_1(number):
    '''
    An example of a function that adds 1 to the input (number)
    '''
    return number + 1

functions are called as follows:

In [8]:
add_1(2)

3

### Copy files into your workspace
We'll be copying files into `olympus_local` (/scratch/$USER/<collection_name>)<br>
Using the `copyfile()` function from the [shutil](https://docs.python.org/3/library/shutil.html#shutil.copyfile) module.<br>
We can read the docs for a function using a well-placed question mark, and running the cell.

In [9]:
?copyfile()     

We see that this function takes a input (src) and an output (dst) -- where the file is to be copied from, and where it is to be copied to.

### Unzip the archive
To uncompress a file, we can use the commandline function `bunzip2`.<br>
We can call commandline functions in Python using Subprocess' [Popen()](https://docs.python.org/3/library/subprocess.html#popen-constructor) function.

This might take a while...
but there are ways to be clever about this, as you'll soon see.

Save this function if you want to reuse it.

In [10]:
def bunzip(f):
    '''
    Unzip a file!
    Uses the Process open (Popen) to access the commandline.
    '''
    process = Popen(['/usr/bin/bunzip2', f], 
                    stdout=PIPE, stderr=PIPE)

### Clean the tweets
Sometimes tweet files contain corrupt patches.<br>
This function uses a tool from [smappdragon](https://github.com/SMAPPNYU/smappdragon/blob/master/smappdragon/tools/tweet_cleaner.py#L7-L18) to isolate bad json blobs.

In [11]:
def clean_file(f):
    '''
    Cleans a tweet.
    JSON blobs that are corrupt will be written to a new file (dirty).
    Functioning JSON blobs are written to (clean), and ultimately replace
    the OG file.
    
    If there are no corrupt tweets in dirty, it will be wiped.
    '''
    clean = f + '.clean_temp'
    dirty = f + '.dirty'

    clean_tweets(f, clean, dirty)
    
    os.remove(f)
    os.rename(clean, f)
    
    if os.stat(dirty).st_size == 0:
        os.remove(dirty)

### Let's bring it together
Bootstrap is used to create the file paths for files in `/scratch/`$`USER`, and <br>
`copy_unzip_clean()` brings the above three functions together sequentially.

In [12]:
def bootstrap(f, collection_name):
    '''
    Get the local file path for compressed and uncompressed tweet.
    '''
    f_name = f.split('/')[-1]
    f_compressed = os.path.join(collection_name, f_name)
    f_uncompressed = f_compressed.replace('.bz2', '')
    
    return f_compressed, f_uncompressed

In [13]:
def copy_unzip_clean(f, collection_name):
    '''
    Bring together the four functions above.
    '''
    f_c, f_u = bootstrap(f, collection_name)

    if os.path.isfile(f_u):
        # file already processed, so exit.
        return
    
    copyfile(f, f_c)
    bunzip(f_c)
    while os.path.isfile(f_c):
        sleep(1)
    clean_file(f_u)

Let's use these functions!

## Implementation
Here we will get all the bzipped files in the collection as input arguments for `copy_unzip_clean()`.<br>
For background, `glob.glob(some_pattern_input)` performs a [RegEx](https://en.wikipedia.org/wiki/Regular_expression) search, and retuns a list file paths that match the pattern.<br>

`Pool(N)` allows us to run the same procedure for **N** files on **N** CPUs.<br>
In this particular implementation, we run the `copy_unzip_clean()` on 4 files at the same time!

In [14]:
files = glob.glob(os.path.join(olympus_tweets, '*.bz2'))[4:8]
args = zip(files, repeat(collection_local))

In [15]:
with Pool(4) as pool:
    '''
    This is the parallelized version of the following:
    
    for f in files:
        copy_unzip_clean(f, collection_local)
    
    Pool(4), means we're using 4 cpus!
    
    pool.starmap() allows us to apply a function to all 4 CPUs in the Pool 
    using all the inputs from args.
    
    starmap (as opposed to map) allows functions that take two or more inputs, 
    in this case, the olympus file name, and the destination directory.
    '''
    pool.starmap(copy_unzip_clean, args)

Which files contained a subset of corrupt JSON?

In [17]:
glob.glob(os.path.join(collection_local, '*.json_dirty'))

[]

# Conclusion
Within this notebook, you saw some tricks and best practices.<br>
Aside from using this environment to develop new ideas and explore data, notebooks encourage reproducibility.


Notebooks can also be transitioned to a .py script with ease...<br>
This notebook exists as a script [here](https://github.com/SMAPPNYU/smapputil/blob/master/py/olympus_2_scratch/olympus2scratch.py), and can be implemented as folows:

```python olympus2scratch.py -c <collection_name> -n <number_of_cpus>```

This is great if you want to transport a large collection in a [sbatch](https://wikis.nyu.edu/display/NYUHPC/Slurm+Tutorial) job!<br>
_You can let it run for hours in the background, and scale up the cpus_

Now that you have data in `/scratch/`$`USER/`, do some analysis!

Do you want to learn more about data wrangling, visualization, or machine learning?<br>
There are plans in the works to accomodate this.<br>
Please send requests though the Google group, email, or in person :)

### Some other cool things:
We can switch Python versions in a Jupyter Notebook, by adding new Kernels.<br>
We can also add an [R-Kernel](https://irkernel.github.io/), which is pretty rad!