# Context


You receive a big text file every 5 min. You need to process its content but your processing script is too slow and your backlog is getting longer. You could rewrite your processing script but that would be a lot of work.

Turns out the machine you're using has lots of core and you could run your processing script in parallel... if you had multiple small files instead of a big one.

This file is text, so you can read it easily but the content is made of multiline blocks, making it impossible to just split anywhere in the file.

### Get the dataset

In [None]:
%%bash 

mkdir -p data

wget https://owncloud.rafiot.eu/s/gp2cn7trXXsae63/download -O data/bview.tar.gz

pushd data
tar xzf bview.tar.gz
popd

### Look at the content of the directory

In [None]:
%%bash

ls data 

## Step 1 - naive approach


Figuring out a separator write a file split it in 7 independent files of the same-ish size

Tools required (**look at the manpages**):
* `head`: look at the file -> find a separator
* `grep`: figure out how many entries we have
* `wc`: count the amout of blocks
* `bc`: compute things -> amout of blocks /file

In [None]:
%%bash

head <number of lines> data/bview.20030809.1600.txt

In [None]:
%%bash

grep <pattern> data/bview.20030809.1600.txt | wc <lines>

#### Compute that in a pythonic way

In [None]:
int(<number>/7)

#### Quick & dirty bash magic

In [None]:
%%bash

ENTRIES=`grep <pattern> data/bview.20030809.1600.txt | wc <lines>`

echo "${ENTRIES}/7"| bc

## Now, split that file!

Documentation: https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files

1. Open the file
2. Iterate over it line by line and store the content in a temporary variable
3. Keep track of the amount of blocs you currently have in the temporaty variable
4. When the number is reached, store the content of the temporary variable in a new file (prepend `split_` to the file name)
5. Repeat until you reach the end of the file
6. Close the file when done

In order to make sure the code you wrote is correct, run the following code:

In [None]:
import glob
import hashlib

with open('data/bview.20030809.1600.txt', 'rb') as f:
    hash_source = hashlib.sha256(f.read()).hexdigest()
print(hash_source)

hash_dest = hashlib.sha256()
for out_file in sorted(glob.glob('split_*.txt')):
    with open(out_file, 'rb') as f:
        hash_dest.update(f.read())
dest = hash_dest.hexdigest()
print(dest)

Ignore this piece of code for now, we'll get back to it later. You only need to know the following: **If the two values it prints are the same, your code is correct.**

### If they're not, concatenate them all again

In [None]:
%%bash 

cat split_*.txt > foo.txt

### Check the difference & fix your code until the validator script print the same value twice

In [None]:
%%bash

diff data/bview.20030809.1600.txt foo.txt

## Step 2 - Function

Make it a function with the following parameters: `source_file_name`, `separator`, `output_name`.

Documentation: https://www.w3schools.com/python/python_functions.asp

Your function's header should look like the following snippet. Run it, and validate that wour code is still working as expected with the validator snippet above.

In [None]:
file_splitter(source_file_name='data/bview.20030809.1600.txt', separator='\n', output_name='split')

## Step 3 - Dynamically compute the number of blocs


What about the file gets lot bigger? Or the size fluctuates?

    i.e we need to dynamically figure out how many blocks we want in each file

Or we want to split it in more/less files?

    i.e. we have more CPUs at hand and can process more files at once

* Use the python `re` module to do the same as `grep` but in python: https://docs.python.org/3/library/re.html#re.findall
* Use the method `len()` to figure out the amount of blocks you have in the file: https://docs.python.org/3/library/functions.html#len

### TODO

1. Write a method that returns the number of blocs (see header below)
2. Update `file_splitter` method accordingly (see header below)

In [None]:
original = 'data/bview.20030809.1600.txt'

total_blocs_in_file = entries_counter(source_file=original)

file_splitter(source_file_name=original, max_blocs_in_file=total_blocs_in_file/7, separator='\n', output_name='split')

## Step 4 - Improve readability of the header


Do we care about the number of entries? Or the number of files?

### TODO

* Update your code to be able to pass a number of file as parameter

In [None]:
file_splitter(source_file_name='data/bview.20030809.1600.txt', number_of_files=7, separator='\n', output_name='split')

## Step 5 - Pythonesque code is better


We're getting there. Let's do some refactoring now to make the code more pythonesque.

### TODO

1. use the `with open ... as ...:` syntax when possible: https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files
2. Use format instead of concatenating text: https://docs.python.org/3/reference/lexical_analysis.html#f-strings or https://docs.python.org/3.3/library/string.html#format-examples
3. Use `round` on entries_per_file: https://docs.python.org/3/library/functions.html#round
4. Add some logging (see the `logging` module): https://docs.python.org/3/howto/logging.html#logging-basic-tutorial

In [None]:
import logging
import re

def entries_counter(source_file):
    # Get number blocks
    logging.debug(f'{nb_blocks} blocks in the file "{source_file}".')
    return nb_blocks

def file_splitter(source_file_name, number_of_files, separator='\n', output_name='split'):
    logging.info(f'Start to split "{source_file_name}" in {number_of_files} files.')
    total_blocs_in_file = entries_counter(source_file=source_file_name)
    max_blocs_in_file = # rounding the amount of blocks/file
    logging.debug(f'{max_blocs_in_file} blocks per file.')
    
    with open(source_file_name, 'r') as original_file:
        file_number = 1
        blocs_in_file = 0
        new_file_content = ''

        # Loop through the file, line by line
        for line in original_file:
            # do the work
    logging.info(f'Done with "{source_file_name}".')

In [None]:
import logging

logging.basicConfig(level=logging.DEBUG)

file_splitter(source_file_name='data/bview.20030809.1600.txt', number_of_files=7, separator='\n', output_name='split')

## Step 5 - Bonus

What happen if you split in more than 9 files and try to validate? Run the following code, try to validate. If it doesn't work, you probably want to look at that: https://stackoverflow.com/questions/339007/nicest-way-to-pad-zeroes-to-a-string

In [None]:
import logging

logging.basicConfig(level=logging.DEBUG)

file_splitter(source_file_name='data/bview.20030809.1600.txt', number_of_files=100, separator='\n', output_name='split')