# Context


You have a big file and you want to extract information from it, and correlate them
with 3rd party services. You get a new file every 5 min.

Processing all that in one single process will take too much time,

This file is text, so you can read it easily but the content is made of multiline blocks.

Use the `validate.sh` script to make sure the files you generate are the same as the source files.

## Step 0


Clone this repo:

```bash
git clone https://github.com/Rafiot/2019-metz.git
```

In [None]:
%%bash 

mkdir -p data

wget https://owncloud.rafiot.eu/s/gp2cn7trXXsae63/download -O data/bview.tar.gz

pushd data
tar xzf bview.tar.gz
popd

In [None]:
%%bash

ls data 

## Step 1 - naive approach


Figuring out a separator write a file split it in 7 independent files of the same-ish size

Tools required:
* `vim` or `head` (look at the file -> find a separator)
* `grep` (figure out how many entries we have
* `wc` (count the amout of blocks)
* `bc` (compute things -> amout of blocks /file)

In [None]:
%%bash

head -n 20 data/bview.20030809.1600.txt

In [None]:
%%bash

grep '^$' data/bview.20030809.1600.txt | wc -l 

In [None]:
%%bash

ENTRIES=`grep '^$' data/bview.20030809.1600.txt | wc -l`

echo "${ENTRIES}/7"| bc

In [None]:
original_file = open('data/bview.20030809.1600.txt', 'r')

file_number = 1
blocs_in_file = 0
new_file_content = ''

# Loop through the file, line by line
for line in original_file:
    # Store the line in a temporary variable
    new_file_content += line
    if line == '\n':
        # Count the blocks
        blocs_in_file += 1
    if blocs_in_file > 193101:
        # If we reach the limit, write the content of the temporary variable in the new file
        new_file = open('split_' + str(file_number) + '.txt', 'w')
        new_file.write(new_file_content)
        new_file.close()
        # Reset counters
        file_number += 1
        blocs_in_file = 0
        new_file_content = ''
original_file.close()

## Validator

In [None]:
import glob
import hashlib

with open('data/bview.20030809.1600.txt', 'rb') as f:
    hash_source = hashlib.sha256(f.read()).hexdigest()
print(hash_source)

hash_dest = hashlib.sha256()
for out_file in sorted(glob.glob('split_*.txt')):
    with open(out_file, 'rb') as f:
        hash_dest.update(f.read())
dest = hash_dest.hexdigest()
print(dest)

## Concatenate

In [None]:
%%bash 

cat split_*.txt > foo.txt

## Diff

In [None]:
%%bash

diff data/bview.20030809.1600.txt foo.txt

## Final solution - naive approach

In [None]:
original_file = open('data/bview.20030809.1600.txt', 'r')

file_number = 1
blocs_in_file = 0
new_file_content = ''

# Loop through the file, line by line
for line in original_file:
    # Store the line in a temporary variable
    new_file_content += line
    if line == '\n':
        # Count the blocks
        blocs_in_file += 1
    if blocs_in_file > 193101:
        # If we reach the limit, write the content of the temporary variable in the new file
        new_file = open('split_' + str(file_number) + '.txt', 'w')
        new_file.write(new_file_content)
        new_file.close()
        # Reset counters
        file_number += 1
        blocs_in_file = 0
        new_file_content = ''
else:
    # EOF reached, writing everything we have in the temporary variable
    new_file = open('split_' + str(file_number) + '.txt', 'w')
    new_file.write(new_file_content)
    new_file.close()
    
original_file.close()

## Step 2 - Function

Make it a function with the following parameters: `source_file_name`, `separator`, `output_name`


In [None]:
def file_splitter(source_file_name, separator='\n', output_name='split'):
    original_file = open(source_file_name, 'r')

    file_number = 1
    blocs_in_file = 0
    new_file_content = ''

    # Loop through the file, line by line
    for line in original_file:
        # Store the line in a temporary variable
        new_file_content += line
        if line == separator:
            # Count the blocks
            blocs_in_file += 1
        if blocs_in_file > 193101:
            # If we reach the limit, write the content of the temporary variable in the new file
            new_file = open(output_name + '_' + str(file_number) + '.txt', 'w')
            new_file.write(new_file_content)
            new_file.close()
            # Reset counters
            file_number += 1
            blocs_in_file = 0
            new_file_content = ''
    else:
        # EOF reached, writing everything we have in the temporary variable
        new_file = open(output_name + '_' + str(file_number) + '.txt', 'w')
        new_file.write(new_file_content)
        new_file.close()

    original_file.close()

In [None]:
file_splitter(source_file_name='data/bview.20030809.1600.txt', separator='\n', output_name='split')

## Step 3 - Dynamically compute the number of blocs


What about the file gets lot bigger? Or the size fluctuates?
    (i.e we need to dynamically figure out how many blocks we want in each file)

Or we want to split it in more/less files?
    (i.e. we have more CPUs at hand and can process more files at once)

Python modules
* `re` (regex, replaces `grep`)

Method:
* `len` (replaces `wc`)

1. count the total amount of blocks (in another method)
2. Divide it by the number of files
3. Update the `file_split` method accordingly


### 3.1 Compute the number of blocs



In [None]:
import re

def entries_counter(source_file):
    matches = re.findall('^$', open(source_file, 'r').read(), flags=re.MULTILINE)
    return len(matches)

In [None]:
total_blocs_in_file = entries_counter(source_file='data/bview.20030809.1600.txt')
print(total_blocs_in_file)

### 3.2 Update file_splitter

Add a parameter `max_blocs_in_file`

In [None]:
def file_splitter(source_file_name, max_blocs_in_file, separator='\n', output_name='split'):
    original_file = open(source_file_name, 'r')

    file_number = 1
    blocs_in_file = 0
    new_file_content = ''

    # Loop through the file, line by line
    for line in original_file:
        # Store the line in a temporary variable
        new_file_content += line
        if line == separator:
            # Count the blocks
            blocs_in_file += 1
        if blocs_in_file > max_blocs_in_file:
            # If we reach the limit, write the content of the temporary variable in the new file
            new_file = open(output_name + '_' + str(file_number) + '.txt', 'w')
            new_file.write(new_file_content)
            new_file.close()
            # Reset counters
            file_number += 1
            blocs_in_file = 0
            new_file_content = ''
    else:
        # EOF reached, writing everything we have in the temporary variable
        new_file = open(output_name + '_' + str(file_number) + '.txt', 'w')
        new_file.write(new_file_content)
        new_file.close()

    original_file.close()

### 3.3 Put everything together

In [None]:
original = 'data/bview.20030809.1600.txt'

total_blocs_in_file = entries_counter(source_file=original)
file_splitter(source_file_name=original, max_blocs_in_file=total_blocs_in_file/7, separator='\n', output_name='split')

## Step 4


Do we care about the number of entries? Or the number of files?

===> Update your code to be able to pass a number of file as parameter

In [None]:
def file_splitter(source_file_name, number_of_files, separator='\n', output_name='split'):
    total_blocs_in_file = entries_counter(source_file=source_file_name)
    max_blocs_in_file = total_blocs_in_file/number_of_files
    
    original_file = open(source_file_name, 'r')

    file_number = 1
    blocs_in_file = 0
    new_file_content = ''

    # Loop through the file, line by line
    for line in original_file:
        # Store the line in a temporary variable
        new_file_content += line
        if line == separator:
            # Count the blocks
            blocs_in_file += 1
        if blocs_in_file > max_blocs_in_file:
            # If we reach the limit, write the content of the temporary variable in the new file
            new_file = open(output_name + '_' + str(file_number) + '.txt', 'w')
            new_file.write(new_file_content)
            new_file.close()
            # Reset counters
            file_number += 1
            blocs_in_file = 0
            new_file_content = ''
    else:
        # EOF reached, writing everything we have in the temporary variable
        new_file = open(output_name + '_' + str(file_number) + '.txt', 'w')
        new_file.write(new_file_content)
        new_file.close()

    original_file.close()

In [None]:
file_splitter(source_file_name='data/bview.20030809.1600.txt', number_of_files=7, separator='\n', output_name='split')

## Step 5


We're getting there. Let's do some refactoring now to make the code more pythonesque.

* use the `with open ... as ...:` syntax when possible
* Use format instead of concatenating text
* Use `round` on entries_per_file
* Add some logging (see the `logging` module)

In [None]:
import logging
import re

def entries_counter(source_file):
    matches = re.findall('^$', open(source_file, 'r').read(), flags=re.MULTILINE)
    nb_blocks = len(matches)
    logging.debug(f'{nb_blocks} blocks in the file "{source_file}".')
    return nb_blocks

def file_splitter(source_file_name, number_of_files, separator='\n', output_name='split'):
    logging.info(f'Start to split "{source_file_name}" in {number_of_files} files.')
    total_blocs_in_file = entries_counter(source_file=source_file_name)
    max_blocs_in_file = round(total_blocs_in_file/float(number_of_files), 0)
    logging.debug(f'{max_blocs_in_file} blocks per file.')
    
    with open(source_file_name, 'r') as original_file:
        file_number = 1
        blocs_in_file = 0
        new_file_content = ''

        # Loop through the file, line by line
        for line in original_file:
            # Store the line in a temporary variable
            new_file_content += line
            if line == separator:
                # Count the blocks
                blocs_in_file += 1
            if blocs_in_file > max_blocs_in_file:
                logging.debug(f'Writing {output_name}_{file_number}.txt.')
                # If we reach the limit, write the content of the temporary variable in the new file
                with open(f'{output_name}_{file_number}.txt', 'w') as new_file:
                    new_file.write(new_file_content)
                # Reset counters
                file_number += 1
                blocs_in_file = 0
                new_file_content = ''
        else:
            logging.debug(f'Writing {output_name}_{file_number}.txt.')
            # EOF reached, writing everything we have in the temporary variable
            with open(f'{output_name}_{file_number}.txt', 'w') as new_file:
                new_file.write(new_file_content)
    logging.info(f'Done with "{source_file_name}".')

In [None]:
import logging

logging.basicConfig(level=logging.DEBUG)

file_splitter(source_file_name='data/bview.20030809.1600.txt', number_of_files=7, separator='\n', output_name='split')

## Step 5 - Bonus

What happen if you split in more than 9 files and try to validate?

In [None]:
import logging

logging.basicConfig(level=logging.DEBUG)

file_splitter(source_file_name='data/bview.20030809.1600.txt', number_of_files=100, separator='\n', output_name='split')

In [None]:
import logging
import re

def entries_counter(source_file):
    matches = re.findall('^$', open(source_file, 'r').read(), flags=re.MULTILINE)
    nb_blocks = len(matches)
    logging.debug(f'{nb_blocks} blocks in the file "{source_file}".')
    return nb_blocks

def file_splitter(source_file_name, number_of_files, separator='\n', output_name='split'):
    logging.info(f'Start to split "{source_file_name}" in {number_of_files} files.')
    total_blocs_in_file = entries_counter(source_file=source_file_name)
    max_blocs_in_file = round(total_blocs_in_file/float(number_of_files), 0)
    logging.debug(f'{max_blocs_in_file} blocks per file.')
    
    padding_length = len(str(number_of_files))
    
    with open(source_file_name, 'r') as original_file:
        file_number = 1
        blocs_in_file = 0
        new_file_content = ''

        # Loop through the file, line by line
        for line in original_file:
            # Store the line in a temporary variable
            new_file_content += line
            if line == separator:
                # Count the blocks
                blocs_in_file += 1
            if blocs_in_file > max_blocs_in_file:
                logging.debug(f'Writing {output_name}_{file_number:0{padding_length}}.txt.')
                # If we reach the limit, write the content of the temporary variable in the new file
                with open(f'{output_name}_{file_number:0{padding_length}}.txt', 'w') as new_file:
                    new_file.write(new_file_content)
                # Reset counters
                file_number += 1
                blocs_in_file = 0
                new_file_content = ''
        else:
            logging.debug(f'Writing {output_name}_{file_number:0{padding_length}}.txt.')
            # EOF reached, writing everything we have in the temporary variable
            with open(f'{output_name}_{file_number:0{padding_length}}.txt', 'w') as new_file:
                new_file.write(new_file_content)
    logging.info(f'Done with "{source_file_name}".')

## Step 6

Let's think a bit how we can make this code more efficient.

Why do we compute the mount of entries? Do we need that? What about using the size of the file instead?

Methods:
* `file.seek`
* `file.tell`

## Step 7


Let's make it better:
* Only open the source file once
* Open as binary file

## Step 8


* Fetch new files when there is something available
    * http://data.ris.ripe.net/rrc00/latest-bview.gz
    * ===> http://docs.python-requests.org/en/master/api/#requests.head & Last-Modified

* Use the library to generate text files:
    * https://bitbucket.org/ripencc/bgpdump/downloads/ (Installation details: https://bitbucket.org/ripencc/bgpdump/wiki/Home.wiki#!building)

    ```
    sh ./bootstrap.sh
    make
    ./bgpdump -T
    ```

    ./bgpdump -O ../data/latest-bview.txt  ../data/original/latest-bview.gz

## Step 9 ++


If you're fast and bored:
* Make it a class (with comments)
* Yield pseudo files (`BytesIO`) instead of writing the files on the disk
* Use `argparse` to make the script more flexible

