In [1]:
%%javascript

IPython.keyboard_manager.command_shortcuts.add_shortcut('9', {
    help: 'Clear all output',               // This text will show up on the help page (CTRL-M h or ESC h)
    handler: function (event) {             // Function that gets invoked
        if (IPython.notebook.mode == 'command') {
            IPython.notebook.clear_all_output();
            return false;
        }
        return true;
    }
});

<IPython.core.display.Javascript object>

In [2]:
%%bash
rm *.py*

## Rationale for weather data munging

The goal of this exercise is to get a program reading the `weather.dat` file and printing the day and minimum temperature values for the day with the lowest minimum temperature within the month depicted in the file.
The program should work like this:

    python weather.py
    9 32

Contents for the `weather.dat` file are tabular space-separated data for weather measurements for a month in a place.
The file has a header line, followed by an empty line, each month's day data, and a last line with month's mean values for some of the columns.
The data lines contain the number of the day of the month, in the first column, and the minimum temperature for this day in the third column.

The contents look like these:

In [None]:
# %load weather.dat
  Dy MxT   MnT   AvT   HDDay  AvDP 1HrP TPcpn WxType PDir AvSp Dir MxS SkyC MxR MnR AvSLP

   1  88    59    74          53.8       0.00 F       280  9.6 270  17  1.6  93 23 1004.5
   2  79    63    71          46.5       0.00         330  8.7 340  23  3.3  70 28 1004.5
   3  77    55    66          39.6       0.00         350  5.0 350   9  2.8  59 24 1016.8
   4  77    59    68          51.1       0.00         110  9.1 130  12  8.6  62 40 1021.1
   5  90    66    78          68.3       0.00 TFH     220  8.3 260  12  6.9  84 55 1014.4
   6  81    61    71          63.7       0.00 RFH     030  6.2 030  13  9.7  93 60 1012.7
   7  73    57    65          53.0       0.00 RF      050  9.5 050  17  5.3  90 48 1021.8
   8  75    54    65          50.0       0.00 FH      160  4.2 150  10  2.6  93 41 1026.3
   9  86    32*   59       6  61.5       0.00         240  7.6 220  12  6.0  78 46 1018.6
  10  84    64    74          57.5       0.00 F       210  6.6 050   9  3.4  84 40 1019.0
  11  91    59    75          66.3       0.00 H       250  7.1 230  12  2.5  93 45 1012.6
  12  88    73    81          68.7       0.00 RTH     250  8.1 270  21  7.9  94 51 1007.0
  13  70    59    65          55.0       0.00 H       150  3.0 150   8 10.0  83 59 1012.6
  14  61    59    60       5  55.9       0.00 RF      060  6.7 080   9 10.0  93 87 1008.6
  15  64    55    60       5  54.9       0.00 F       040  4.3 200   7  9.6  96 70 1006.1
  16  79    59    69          56.7       0.00 F       250  7.6 240  21  7.8  87 44 1007.0
  17  81    57    69          51.7       0.00 T       260  9.1 270  29* 5.2  90 34 1012.5
  18  82    52    67          52.6       0.00         230  4.0 190  12  5.0  93 34 1021.3
  19  81    61    71          58.9       0.00 H       250  5.2 230  12  5.3  87 44 1028.5
  20  84    57    71          58.9       0.00 FH      150  6.3 160  13  3.6  90 43 1032.5
  21  86    59    73          57.7       0.00 F       240  6.1 250  12  1.0  87 35 1030.7
  22  90    64    77          61.1       0.00 H       250  6.4 230   9  0.2  78 38 1026.4
  23  90    68    79          63.1       0.00 H       240  8.3 230  12  0.2  68 42 1021.3
  24  90    77    84          67.5       0.00 H       350  8.5 010  14  6.9  74 48 1018.2
  25  90    72    81          61.3       0.00         190  4.9 230   9  5.6  81 29 1019.6
  26  97*   64    81          70.4       0.00 H       050  5.1 200  12  4.0 107 45 1014.9
  27  91    72    82          69.7       0.00 RTH     250 12.1 230  17  7.1  90 47 1009.0
  28  84    68    76          65.6       0.00 RTFH    280  7.6 340  16  7.0 100 51 1011.0
  29  88    66    77          59.7       0.00         040  5.4 020   9  5.3  84 33 1020.6
  30  90    45    68          63.6       0.00 H       240  6.0 220  17  4.8 200 41 1022.7
  mo  82.9  60.5  71.7    16  58.8       0.00              6.9          5.3


## Bootstrapping code base

To start with the code, following Test Driven Development, we'll first create the `test_weather.py` file with the following contents:

```python
import weather

def test_process_weather():
    weather.process()
```

In [3]:
%%writefile test_weather.py
import weather

def test_process_weather():
    weather.process()

Writing test_weather.py


Once created, running the tests should make it break with an error, since there is no such `weather` module, yet:

In [4]:
%%bash
py.test test_weather.py

platform darwin -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
benchmark: 3.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=5.00us max_time=1.00s calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/ifosch/src/github.com/BCNDojos/pyDojos/factor-out, inifile: 
plugins: benchmark-3.0.0
collected 0 items / 1 errors

________________________________________________________________________________________ ERROR collecting test_weather.py _________________________________________________________________________________________
test_weather.py:1: in <module>
    import weather
E   ImportError: No module named 'weather'


So next step we'll take is to create a trivial module, which will do actually nothing:

```python
def process():
    pass
    ```

In [5]:
%%writefile weather.py
def process():
    pass

Writing weather.py


With the `weather.py` module created, test should pass ok, now:

In [6]:
%%bash
py.test test_weather.py

platform darwin -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
benchmark: 3.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=5.00us max_time=1.00s calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/ifosch/src/github.com/BCNDojos/pyDojos/factor-out, inifile: 
plugins: benchmark-3.0.0
collected 1 items

test_weather.py .



## Data reading iteration

In that first iteration, we'll focus on load data lines from the file.
The approach chosen is pretty simple, just read lines from the file, and print them all.

So for the first test, we'll check the function prints the content of the file stripping lines, when called.
In order to do that check the output of the script with py.test, we'll be using the `capsys` plugin, which allows the test environment to keep standard output and error in memory, and enables the test to check these afterwards:

```python
import weather

def test_process_weather(capsys):
    weather.process()
    out, err = capsys.readouterr()
    assert out.startswith("Dy")
```

In [7]:
%%writefile test_weather.py
import weather

def test_process_weather(capsys):
    weather.process()
    out, err = capsys.readouterr()
    assert out.startswith("Dy")

Overwriting test_weather.py


With the test written, let's check this test is not passing now:

In [8]:
%%bash
py.test test_weather.py

platform darwin -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
benchmark: 3.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=5.00us max_time=1.00s calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/ifosch/src/github.com/BCNDojos/pyDojos/factor-out, inifile: 
plugins: benchmark-3.0.0
collected 1 items

test_weather.py F

______________________________________________________________________________________________ test_process_weather _______________________________________________________________________________________________

capsys = <_pytest.capture.CaptureFixture object at 0x104849940>

    def test_process_weather(capsys):
        weather.process()
        out, err = capsys.readouterr()
>       assert out.startswith("Dy")
E       assert <built-in method startswith of str object at 0x1003acab0>('Dy')
E        +  where <built-in method startswith of str object at 0x1003acab0> = ''.startswith

test_weather.py:6: Asse

When doing TDD, we should never write tests, or assertions, while there are previous ones not passing.
Also, a strict TDD approach, expects the code to be simple enough to satisfy tests.
So a possible code to solve this, could be this following one:

```python
def process():
    print("Dy")
```

In [9]:
%%writefile weather.py
def process():
    print("Dy")

Overwriting weather.py


This silly code, effectively passes the test:

In [10]:
%%bash
py.test test_weather.py

platform darwin -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
benchmark: 3.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=5.00us max_time=1.00s calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/ifosch/src/github.com/BCNDojos/pyDojos/factor-out, inifile: 
plugins: benchmark-3.0.0
collected 1 items

test_weather.py .



So, following such strict TDD approach, should imply to add a new test or assertion to enforce more accurated results.
To exemplify this, the following change in the test will assert the output contains more than one line:

```python
import weather

def test_process_weather(capsys):
    weather.process()
    out, err = capsys.readouterr()
    assert out.startswith("Dy")
    assert len(out.split("\n")) > 2
```

In [11]:
%%writefile test_weather.py
import weather

def test_process_weather(capsys):
    weather.process()
    out, err = capsys.readouterr()
    assert out.startswith("Dy")
    assert len(out.split("\n")) > 2

Overwriting test_weather.py


So, now the tests will be red again:

In [12]:
%%bash
py.test test_weather.py

platform darwin -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
benchmark: 3.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=5.00us max_time=1.00s calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/ifosch/src/github.com/BCNDojos/pyDojos/factor-out, inifile: 
plugins: benchmark-3.0.0
collected 1 items

test_weather.py F

______________________________________________________________________________________________ test_process_weather _______________________________________________________________________________________________

capsys = <_pytest.capture.CaptureFixture object at 0x1013aa4e0>

    def test_process_weather(capsys):
        weather.process()
        out, err = capsys.readouterr()
        assert out.startswith("Dy")
>       assert len(out.split("\n")) > 2
E       assert 2 > 2
E        +  where 2 = len(['Dy', ''])
E        +    where ['Dy', ''] = <built-in method split of str object at 0x1049b9a40>('\n')
E   

Of course, to make this new assertion pass, we could just add new lines to the output... but following this trend here, it would make this iteration unnecessarily long, so we'll go for a more straight forward solution:

```python
def process():
    file = open('weather.dat')
    print("".join(file.readlines()).strip())
    file.close()
```

Here, we're using the standard library to `open` the file, read all lines with `readlines`, `strip` all spaces and new line characters from both sides of the line, and `join` the resulting data with an empty string, to print this out.

In [13]:
%%writefile weather.py
def process():
    file = open('weather.dat')
    print("".join(file.readlines()).strip())
    file.close()

Overwriting weather.py


Now this code will pass the tests:

In [14]:
%%bash
py.test test_weather.py

platform darwin -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
benchmark: 3.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=5.00us max_time=1.00s calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/ifosch/src/github.com/BCNDojos/pyDojos/factor-out, inifile: 
plugins: benchmark-3.0.0
collected 1 items

test_weather.py .



As last requirement for this iteration, we want just to get the lines with data, not headers, or empty lines:

```python
import weather

def test_process_weather(capsys):
    weather.process()
    out, err = capsys.readouterr()
    assert out.startswith("1")
    assert len(out.split("\n")) > 2
```

In [15]:
%%writefile test_weather.py
import weather

def test_process_weather(capsys):
    weather.process()
    out, err = capsys.readouterr()
    assert out.startswith("1")
    assert len(out.split("\n")) > 2

Overwriting test_weather.py


So, now, the test will not pass again:

In [16]:
%%bash
py.test test_weather.py

platform darwin -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
benchmark: 3.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=5.00us max_time=1.00s calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/ifosch/src/github.com/BCNDojos/pyDojos/factor-out, inifile: 
plugins: benchmark-3.0.0
collected 1 items

test_weather.py F

______________________________________________________________________________________________ test_process_weather _______________________________________________________________________________________________

capsys = <_pytest.capture.CaptureFixture object at 0x1048bec18>

    def test_process_weather(capsys):
        weather.process()
        out, err = capsys.readouterr()
>       assert out.startswith("1")
E       assert <built-in method startswith of str object at 0x105099e00>('1')
E        +  where <built-in method startswith of str object at 0x105099e00> = 'Dy MxT   MnT   AvT   HDDay  AvDP 1HrP T

Assuming that the header and empty lines could be differ in other versions of the data file, we'll try the approach of considering any line starting with a number (for the day) as data line.
To do that, we'll use the standard `re` module, to match these lines for printing:

```python
import re

def process():
    file = open('weather.dat')
    pattern = r"[0-9]+.*"
    for line in file.readlines():
        match = re.match(pattern, line.strip())
        if match:
            print(line.strip())
    file.close()
```

Here we use `re.match` to ensure the pattern is applied from the beginning of the string, meaning, it just start with some numbers, at least, 1.
The result will be `None`, which evaluates to `False`, is the pattern is not satisfied, what should happen with the header, empty, and footer lines.
For the rest of the lines, the data ones, that will return a match object, which evaluates to `True`.

In [17]:
%%writefile weather.py
import re

def process():
    file = open('weather.dat')
    pattern = r"[0-9]+.*"
    for line in file.readlines():
        match = re.match(pattern, line.strip())
        if match:
            print(line.strip())
    file.close()

Overwriting weather.py


This new code now passes the tests:

In [18]:
%%bash
py.test test_weather.py

platform darwin -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
benchmark: 3.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=5.00us max_time=1.00s calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/ifosch/src/github.com/BCNDojos/pyDojos/factor-out, inifile: 
plugins: benchmark-3.0.0
collected 1 items

test_weather.py .



Probably at this point, we could do some refactoring for improving the code that is reading.
For this example, we chose to strictly avoid premature optimization, but to be true, this could probably be done at this point safely.
Possible optimization refactoring is done within the third iteration.

## Data munching iteration

Next step should consist in getting the data desired for each data line in the file:

```python
import weather

def test_process_weather(capsys):
    weather.process()
    out, err = capsys.readouterr()
    assert out.startswith("1")
    output_lines = out.split("\n")
    assert len(output_lines) > 2
    assert output_lines[0] == "1 59"
```

In [19]:
%%writefile test_weather.py
import weather

def test_process_weather(capsys):
    weather.process()
    out, err = capsys.readouterr()
    assert out.startswith("1")
    output_lines = out.split("\n")
    assert len(output_lines) > 2
    assert output_lines[0] == "1 59"

Overwriting test_weather.py


So the test will be red again:

In [20]:
%%bash
py.test test_weather.py

platform darwin -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
benchmark: 3.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=5.00us max_time=1.00s calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/ifosch/src/github.com/BCNDojos/pyDojos/factor-out, inifile: 
plugins: benchmark-3.0.0
collected 1 items

test_weather.py F

______________________________________________________________________________________________ test_process_weather _______________________________________________________________________________________________

capsys = <_pytest.capture.CaptureFixture object at 0x103dccb00>

    def test_process_weather(capsys):
        weather.process()
        out, err = capsys.readouterr()
        assert out.startswith("1")
        output_lines = out.split("\n")
        assert len(output_lines) > 2
>       assert output_lines[0] == "1 59"
E       assert '1  88    59 ... 93 23 1004.5' == '1 59'
E         - 1  88    59

One good way to get this, is to change the regular expression to catch both the day and minimum temperature:

```python
import re

def process():
    file = open('weather.dat')
    pattern = r"(?P<day>[0-9]+)\s+[0-9]+\s+(?P<min>[0-9]+).*"
    for line in file.readlines():
        match = re.match(pattern, line.strip())
        if match:
            print("{} {}".format(match.group('day'), match.group('min')))
    file.close()
```

Here we're using parentheses to capture specific data from the lines, so we can get the data matching the regular expressions between those from the resulting match object.
Also the expressions between parentheses, are named, so the way to catch these groups will be through these names.

In [21]:
%%writefile weather.py
import re

def process():
    file = open('weather.dat')
    pattern = r"(?P<day>[0-9]+)\s+[0-9]+\s+(?P<min>[0-9]+).*"
    for line in file.readlines():
        match = re.match(pattern, line.strip())
        if match:
            print("{} {}".format(match.group('day'), match.group('min')))
    file.close()

Overwriting weather.py


Now the tests will be passing:

In [22]:
%%bash
py.test test_weather.py

platform darwin -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
benchmark: 3.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=5.00us max_time=1.00s calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/ifosch/src/github.com/BCNDojos/pyDojos/factor-out, inifile: 
plugins: benchmark-3.0.0
collected 1 items

test_weather.py .



Now we'll change the test to ensure it just prints the expected answer:

```python
import weather

def test_process_weather(capsys):
    weather.process()
    out, err = capsys.readouterr()
    assert out == "9 32\n"
```

In [23]:
%%writefile test_weather.py
import weather

def test_process_weather(capsys):
    weather.process()
    out, err = capsys.readouterr()
    assert out == "9 32\n"

Overwriting test_weather.py


Since the code still return the first and third columns for all rows, this test will not pass:

In [24]:
%%bash
py.test test_weather.py

platform darwin -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
benchmark: 3.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=5.00us max_time=1.00s calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/ifosch/src/github.com/BCNDojos/pyDojos/factor-out, inifile: 
plugins: benchmark-3.0.0
collected 1 items

test_weather.py F

______________________________________________________________________________________________ test_process_weather _______________________________________________________________________________________________

capsys = <_pytest.capture.CaptureFixture object at 0x10498c6d8>

    def test_process_weather(capsys):
        weather.process()
        out, err = capsys.readouterr()
>       assert out == "9 32\n"
E       assert '1 59\n2 63\n...9 66\n30 45\n' == '9 32\n'
E         - 1 59
E         - 2 63
E         - 3 55
E         - 4 59
E         - 5 66
E         - 6 61
E         - 7 57
E         - 8 54
E     

So the code needs to be modified to satisfy this now.
That can be accomplished by checking the temperature iteratively while reading the file, and capturing the day and updating the temperature when a minor temperature is found.
Once the loop is completed, it will print the data found:

```python
import re

def process():
    file = open('weather.dat')
    pattern = r"(?P<day>[0-9]+)\s+[0-9]+\s+(?P<min>[0-9]+).*"
    day = 0
    temp = 1000
    for line in file.readlines():
        match = re.match(pattern, line.strip())
        if match:
            if int(match.group('min')) < temp:
                day = match.group('day')
                temp = int(match.group('min'))
    print("{} {}".format(day, temp))
    file.close()
```

In [25]:
%%writefile weather.py
import re

def process():
    file = open('weather.dat')
    pattern = r"(?P<day>[0-9]+)\s+[0-9]+\s+(?P<min>[0-9]+).*"
    day = 0
    temp = 1000
    for line in file.readlines():
        match = re.match(pattern, line.strip())
        if match:
            if int(match.group('min')) < temp:
                day = match.group('day')
                temp = int(match.group('min'))
    print("{} {}".format(day, temp))
    file.close()

Overwriting weather.py


Now the test should pass:

In [26]:
%%bash
py.test test_weather.py

platform darwin -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
benchmark: 3.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=5.00us max_time=1.00s calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/ifosch/src/github.com/BCNDojos/pyDojos/factor-out, inifile: 
plugins: benchmark-3.0.0
collected 1 items

test_weather.py .



## Refactoring file reading

In the last iteration, while having a working solution, we'll refactor the code to make it more idiomatic and efficient, without modifying nor breaking the test:

```python
import re

def process():
    pattern = r"(?P<day>[0-9]+)\s+[0-9]+\s+(?P<min>[0-9]+).*"
    day = 0
    temp = 1000
    with open('weather.dat') as file:
        for line in file.readlines():
            match = re.match(pattern, line.strip())
            if match:
                if int(match.group('min')) < temp:
                    day = match.group('day')
                    temp = int(match.group('min'))
    print("{} {}".format(day, temp))
```

A first refactor that can be done, is just idiomatic.
It is just changing how we open the file, making it unnecessary to explicitly close the file.
By using the `with open('weather.dat') as file:` construct, Python will automatically close the file by the end of the block for us.

In [27]:
%%writefile weather.py
import re

def process():
    pattern = r"(?P<day>[0-9]+)\s+[0-9]+\s+(?P<min>[0-9]+).*"
    day = 0
    temp = 1000
    with open('weather.dat') as file:
        for line in file.readlines():
            match = re.match(pattern, line.strip())
            if match:
                if int(match.group('min')) < temp:
                    day = match.group('day')
                    temp = int(match.group('min'))
    print("{} {}".format(day, temp))

Overwriting weather.py


Once this refactor is done, we need to ensure it is still passing the test:

In [28]:
%%bash
py.test test_weather.py

platform darwin -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
benchmark: 3.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=5.00us max_time=1.00s calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/ifosch/src/github.com/BCNDojos/pyDojos/factor-out, inifile: 
plugins: benchmark-3.0.0
collected 1 items

test_weather.py .



## Refactoring data collection

About how we are reading lines, the `readlines` method is not the best choice, since it's not memory efficient and pretty much slower.
The best approach to get this kind of loop, is to iterate directly on the file object:

```python
import re

def process():
    pattern = r"(?P<day>[0-9]+)\s+[0-9]+\s+(?P<min>[0-9]+).*"
    day = 0
    temp = 1000
    with open('weather.dat') as file:
        for line in file:
            match = re.match(pattern, line.strip())
            if match:
                if int(match.group('min')) < temp:
                    day = match.group('day')
                    temp = int(match.group('min'))
    print("{} {}".format(day, temp))
```

In [29]:
%%writefile weather.py
import re

def process():
    pattern = r"(?P<day>[0-9]+)\s+[0-9]+\s+(?P<min>[0-9]+).*"
    day = 0
    temp = 1000
    with open('weather.dat') as file:
        for line in file:
            match = re.match(pattern, line.strip())
            if match:
                if int(match.group('min')) < temp:
                    day = match.group('day')
                    temp = int(match.group('min'))
    print("{} {}".format(day, temp))

Overwriting weather.py


Which still passes the test:

In [30]:
%%bash
py.test test_weather.py

platform darwin -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
benchmark: 3.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=5.00us max_time=1.00s calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/ifosch/src/github.com/BCNDojos/pyDojos/factor-out, inifile: 
plugins: benchmark-3.0.0
collected 1 items

test_weather.py .



## Refactoring data parsing

Another good point could be to change usage of `re`, which we're using to cache columns from space separated rows, in favor of using simple plain `split`.
This optimization should be pretty clear, since using `split`, all the regular expression evaluation is taken out of the processing, but taking a peek on benchmarking for processing times using `pytest-benchmark` could be interesting.
So we'll need to `pip install pytest-benchmark`, and add a new test function:

```python
import weather

def test_process_weather(capsys):
    weather.process()
    out, err = capsys.readouterr()
    assert out == "9 32\n"

def test_benchmark_process_weather(benchmark):
    benchmark(weather.process)
```

In case you're thinking why the use of benchmark is done in a separate test function, the reasoning is that `benchmark` repeats the function run more than once, and then the output would not be just one single line.
One other way to do that would be to make the test function return another function decorated with benchmark, but this would make an unnecessary function call.

In [31]:
%%writefile test_weather.py
import weather

def test_process_weather(capsys):
    weather.process()
    out, err = capsys.readouterr()
    assert out == "9 32\n"

def test_benchmark_process_weather(benchmark):
    benchmark(weather.process)

Overwriting test_weather.py


Since the code still return the first and third columns for all rows, this test will not pass:

In [32]:
%%bash
py.test test_weather.py

platform darwin -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
benchmark: 3.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=5.00us max_time=1.00s calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/ifosch/src/github.com/BCNDojos/pyDojos/factor-out, inifile: 
plugins: benchmark-3.0.0
collected 2 items

test_weather.py ..


Computing stats ...Computing stats ... group 1/1Computing stats ... group 1/1: minComputing stats ... group 1/1: min (1/1)Computing stats ... group 1/1: min (1/1)Computing stats ... group 1/1: maxComputing stats ... group 1/1: max (1/1)Computing stats ... group 1/1: max (1/1)Computing stats ... group 1/1: meanComputing stats ... group 1/1: mean (1/1)Computing stats ... group 1/1: mean (1/1)Computing stats ... group 1/1: medianComputing stats ... group 1/1: median (1/1)Computing stats ... group 1/1: median (1/1)Computing stats ... group 1/1: iqrComputing stats ... group 1/1: iqr (1/1)Com

The output of the tests are slightly different now, since it includes a performance report table for the benchmarked function.
Most of the statistics shown are pretty clear, but there are these two, `IQR` and `Outliers` which are pretty useful to compare not so different versions of the code.
When the function is so fast and improvements so small that usual stats, like `Min`, `Max`, and so on, are not definitely indicators of improvement, usually, `IQR` and `Outliers` are much better to signal such improvements.

So the improvement in the code, for this, would be to use `split` on the data line to get the different columns:

```python
def process():
    day = 0
    temp = 1000
    with open('weather.dat') as file:
        for line in file:
            columns = line.replace("*", "").split()
            try:
                if len(columns) > 0 and int(columns[2]) < temp:
                    day = columns[0]
                    temp = int(columns[2])
            except ValueError:
                pass
    print("{} {}".format(day, temp))
```

This change, also implies:
* Some numbers in the columns are marked with an `*`. The `re` version was taking this out, but this new version needs to explicitly take this out, what can be accomplished by replacing `*` with an empty string.
* The previous `if match` assertion needs to be replaced by another criteria, to check if `columns` is not an empty string.
* The data accessors from the match object, now need to be replaced by list accessors to the corresponding columns.
* ValueError exception need to be captured for lines with string columns, like the header one.

In [33]:
%%writefile weather.py
def process():
    day = 0
    temp = 1000
    with open('weather.dat') as file:
        for line in file:
            columns = line.replace("*", "").split()
            try:
                if len(columns) > 0 and int(columns[2]) < temp:
                    day = columns[0]
                    temp = int(columns[2])
            except ValueError:
                pass
    print("{} {}".format(day, temp))

Overwriting weather.py


In [34]:
%%bash
py.test test_weather.py

platform darwin -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
benchmark: 3.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=5.00us max_time=1.00s calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/ifosch/src/github.com/BCNDojos/pyDojos/factor-out, inifile: 
plugins: benchmark-3.0.0
collected 2 items

test_weather.py ..


Computing stats ...Computing stats ... group 1/1Computing stats ... group 1/1: minComputing stats ... group 1/1: min (1/1)Computing stats ... group 1/1: min (1/1)Computing stats ... group 1/1: maxComputing stats ... group 1/1: max (1/1)Computing stats ... group 1/1: max (1/1)Computing stats ... group 1/1: meanComputing stats ... group 1/1: mean (1/1)Computing stats ... group 1/1: mean (1/1)Computing stats ... group 1/1: medianComputing stats ... group 1/1: median (1/1)Computing stats ... group 1/1: median (1/1)Computing stats ... group 1/1: iqrComputing stats ... group 1/1: iqr (1/1)Com

So for many of the standard statistics, this version is not better, it could even considered worst, but `IQR` metric are much better, meaning that the difference among longer running execution rounds are much lesser.
Also, `StdDev` and `Outliers` show up the quantity of cases taking more than 1 standard deviation from mean are much less and faster than in the previous run.
So these should point to a performance improvement, though pretty slight.