Re write the result set #1165

drvinceknight · 2018-01-29T18:24:26Z

Note this PR is to a possible-v4 branch which is a branch of master

(@marcharper @meatballs Genuinely sorry for the diff on result_set.py it's a mess as this kind of guts that file... If we like the idea of what I've done here I'm happy to fix that as best I can.)

At present the result set plays the matches in parallel followed by a
(sometimes computationally expensive) single process of reading in and
analysing the interactions.

TLDR: This changes how the internal result set calculations are being
done. They are more efficiently calculated.

This happens by doing the following:

The various match by match calculations are done by the tournament
(the winner of a match, the cooperation count, the score etc...). Mainly thanks to Tournament._calculate_results
These calculations are written to file (at present we just write the
actual actions of an interaction) thanks to this line of Tournament._write_interaction_to_file
The form of this file has also changed: for every match there are 2
rows. One row corresponds to each player. This might seem more costly storage
wise but is done to enable faster analysis (more on that later).
The analysis is now done entirely using dask: "Dask is a flexible
parallel computing library for analytic computing."
(https://dask.pydata.org/). This ensures all calculations are done on
disk (so no huge memory problems) but also that they can be done in
parallel. This is all done using the nice Pandas-like API that dask
has so essentially all calculations for the result set are just done
by a few groupby statements. The tasks to be run are built with ResultSet._build_tasks and then compute (in parallel) with ResultSet._compute_tasks.
There is some work being done outside of dask but that's just
reshaping data. dask outputs pandas.Series and to be consistent
with our current setup these are changes to be lists of list etc... This is all done in ResultSet._reshape_out

Some user facing changes (which is why I suggest this would be for a
v4.0.0 release):

The result_set.interactions is no longer possible. This is a choice
and not forced by the redesign: I don't think this is ever necessary or
helpful. The data file can always be read in.
The ability to tournament.play(in_memory=True) has been removed.
Again, not entirely a forced change (although it would be a tiny bit
of work to allow this). Given the recent work to make everything work
on Windows I don't think this is necessary and has allowed for a big
deletion of code (which is a good thing re maintenance).
The interactions data file is now in a different format, this is
forced by the design choice.
I have made a slight modification to result_set.cooperation.
Currently this sets all self interactions to be 0 but I think that's not
the right way to display it (note that the cooperation rates were all
being done correctly).

As well as ensuring the work done in series is reduced and the parallel
workers also calculate the scores (which I think is more efficient)
this also seems to be faster:

On this branch:

import axelrod as axl
players = [s() for s in axl.strategies if s.classifier["memory_depth"] == 1]
tournament = axl.Tournament(players, turns=200, repetitions=100)
results = tournament.play(processes=4)

Took: 1min 49s

import axelrod as axl
players = [s() for s in axl.short_run_time_strategies]
tournament = axl.Tournament(players, turns=200, repetitions=20)
results = tournament.play(processes=4)

Took: 21min 2s

On current master;

import axelrod as axl
players = [s() for s in axl.strategies if s.classifier["memory_depth"] == 1]
tournament = axl.Tournament(players, turns=200, repetitions=100)
results = tournament.play(processes=4)

Took: 2min 36s

import axelrod as axl
players = [s() for s in axl.short_run_time_strategies]
tournament = axl.Tournament(players, turns=200, repetitions=20)
results = tournament.play(processes=4)

Took: 28min 8s

One final aspect to consider (I think) is that adding dask as a
dependency open up the potential to use it's delayed decorator to do
all our parallel processing. This would have the benefit of being able
to use a distributed scheduler that dask has. (We'd have to
investigate if this actually works based on our parallelisation but at
least the possibility is there).

At present the result set plays the matches in parallel followed by a (sometimes computationally expensive) single process of reading in and analysing the interactions. TLDR: This changes how the internal result set calculations are being done. They are more efficiently calculated. This happens by doing the following: 1. The various match by match calculations are done by the tournament (the winner of a match, the cooperation count, the score etc...). 2. This calculations are written to file (at present we just write the actual actions of an interaction). 3. The form of this file has also changed: for every match there are 2 rows. One row corresponds to each player. This might seem costly storage wise but is done to enable faster analysis (more on that later). 4. The analysis is now done entirely using `dask`: "Dask is a flexible parallel computing library for analytic computing." (https://dask.pydata.org/). This ensures all calculations are done on disk (so no huge memory problems) but also that they can be done **in parallel**. This is all done using the nice Pandas-like API that dask has so essentially all calculations for the result set are just done by a few `groupby` statements. 5. There is *some* work being done outside of dask but that's just reshaping data. `dask` outputs `pandas.Series` and to be consistent with our current setup these are changes to be lists of list etc... **Some user facing changes** (which is why I suggest this would be for a `v4.0.0` release): - The `result_set.interactions` is no longer possible. This is a choice and not forced by the redesign: I don't think this is ever necessary or helpful. The data file can always be read in. - The ability to `tournament.play(in_memory=True)` has been removed. Again, not entirely a forced change (although it would be a tiny bit of work to allow this). Given the recent work to make everything work on Windows I don't think this is necessary and has allowed for a big deletion of code (which is a good thing re maintenance). - The interactions data file is now in a different format, this is forced by the design choice. - I have made a slight modification to `result_set.cooperation`. Currently this sets all self interactions to be 0 but I think that's not the right way to display it (note that the cooperation rates were all being done correctly). **As well as ensuring the work done in series is reduced and the parallel workers also calculate the scores (which I think is more efficient)** this also seems to be faster: On this branch: ```python import axelrod as axl players = [s() for s in axl.strategies if s.classifier["memory_depth"] == 1] tournament = axl.Tournament(players, turns=200, repetitions=100) results = tournament.play(processes=4) ``` Took: 1min 49s ```python import axelrod as axl players = [s() for s in axl.short_run_time_strategies] tournament = axl.Tournament(players, turns=200, repetitions=20) results = tournament.play(processes=4) ``` Took: 21min 2s On current master; ```python import axelrod as axl players = [s() for s in axl.strategies if s.classifier["memory_depth"] == 1] tournament = axl.Tournament(players, turns=200, repetitions=100) results = tournament.play(processes=4) ``` Took: 2min 36s ```python import axelrod as axl players = [s() for s in axl.short_run_time_strategies] tournament = axl.Tournament(players, turns=200, repetitions=20) results = tournament.play(processes=4) ``` Took: 28min 8s **One final aspect to consider** (I think) is that adding `dask` as a dependency open up the potential to use it's `delayed` decorator to do all our parallel processing. This would have the benefit of being able to use a distributed scheduler that `dask` has. (We'd have to investigate if this actually works based on our parallelisation but at least the possibility is there).

marcharper

Will review more -- github isn't loading the large diffs and I don't want to lose the comments I've already made.

Do we want to add an end user function to assist in reading the interactions file?

marcharper · 2018-01-30T04:09:50Z

axelrod/interaction_utils.py

@@ -239,36 +240,27 @@ def compute_sparklines(interactions, c_symbol='█', d_symbol=' '):
        sparkline(histories[1], c_symbol, d_symbol))


-def read_interactions_from_file(filename, progress_bar=True,
-                                num_interactions=False):
+def read_interactions_from_file(filename,


Will review more -- github isn't loading the large diffs and I don't want to lose the comments I've already made.

Yeah, no rush with this at all. Small piece meal chunks is probably the way to go. Sorry the diff is so horrible.

Do we want to add an end user function to assist in reading the interactions file?

We've got iu.read_interactions_from_file which I've documented in
843b008, I think that could be user facing (and in fact we could have another tutorial going over each function in iu ?) but let me know if you were thinking of something different.

marcharper · 2018-01-30T04:10:54Z

axelrod/interaction_utils.py

+    for _, d in tqdm.tqdm(groupby):
+        key = tuple(d[["Player index", "Opponent index"]].iloc[0])
+        value = list(map(str_to_actions, zip(*d["Actions"])))
+        try:


Could use a defaultdict(list) here.

Good call, much nicer 👍

This highlights it as a user facing function but I'm happy to do something further.

Hopefully this now works on windows.

Attempting to make file comparison work on windows.

marcharper

In principle everything looks ok. There were several places where the spacing seems off / isn't the most readable choice IMO.

marcharper · 2018-01-31T11:39:19Z

axelrod/fingerprint.py

-                else:
-                    cooperation_rates[opponent_index] = [did_c(player_history)]
+        df = dd.read_csv(filename)
+        df = df[df["Player index"] == 0][["Opponent index", "Actions"]]


A comment here may be useful in the future, particularly why the need for the filter for player index.

marcharper · 2018-02-01T00:23:26Z

axelrod/result_set.py


-from numpy import mean, nanmedian, std
+from numpy import mean, nanmedian, std, array, nan_to_num


minor: I'd prefer to import numpy as np and then use np.mean and such because names like mean, std, array are common and there could be collisions.

Yeah I agree, I'll change this.

marcharper · 2018-02-01T00:26:10Z

axelrod/result_set.py


-        del self.repetitions_d  # Manual garbage collection
-        return repetitions
+        (mean_per_reps_player_opponent_df,


Is there a reason not to pass these in explicitly as parameters? Or to use *args instead?

Yeah that's way better 👍

marcharper · 2018-02-01T00:28:54Z

axelrod/result_set.py


-            [pi1, pi2, pi3, ..., pim]
+    @update_progress_bar
+    def _build_match_lengths(self, length_series):


Seems like these functions could be abstracted, they are all very similar.

I thought that and started taking a look at it but they've all got tiny differences that made it a bit painful. I'll take another look :) 👍

marcharper · 2018-02-01T00:29:17Z

axelrod/result_set.py

+            for opponent_index in range(self.num_players):
+                count = cooperation_dict.get((player_index, opponent_index),0)
+                if player_index == opponent_index:
+					# Address double count


marcharper · 2018-02-01T00:30:33Z

axelrod/result_set.py

                else:
-                    payoff_stddevs[player][opponent] = 0
+                    row.append(good_partner_dict.get((player_index,


minor: In a case like this I think wrapping the entire tuple is more readable:

row.append(good_partner_dict.get( (player_index, opponent_index), 0))

marcharper · 2018-02-01T00:34:05Z

axelrod/result_set.py

+    @update_progress_bar
+    def _build_initial_cooperation_count(self, initial_cooperation_count_series):
+        initial_cooperation_count_dict = initial_cooperation_count_series.to_dict()
+        initial_cooperation_count = [


Spacing on this one seems weird.

marcharper · 2018-02-01T00:36:26Z

axelrod/result_set.py

+        columns = ["Win", "Score"]
+        sum_per_player_repetition_task = adf.groupby(groups)[columns].sum()
+
+        normalised_scores_task = adf.groupby(["Player index",


maybe break up into two lines or use different spacing

marcharper · 2018-02-01T00:37:46Z

axelrod/tournament.py

-                keep_interactions=keep_interactions, in_memory=in_memory
-            )
-
+            result_set = ResultSet(


marcharper · 2018-02-01T00:43:12Z

axelrod/tournament.py

+
+                        states = [(C, C), (C, D), (D, C), (D, D)]
+                        for state in states:
+                            if index == 1:


pull to the outer context; consider using reverse() (modifies in place)

Note I did not use `reversed()`, this doesn't return a tuple, I also didn't use `.reverse()` this only works on lists.

drvinceknight · 2018-02-01T11:47:54Z

In principle everything looks ok.

Thanks for taking the time to look through.

There were several places where the spacing seems off / isn't the most readable choice IMO.

I think I've got them all but I'm not at all precious here, happy to change things further.

I believe I've caught all your other suggestions/requests, I've gotten rid of some of the methods replacing with abstractions. I believe that the ones that are left are probably a good level of "reducible" and further abstraction would be unhelpful but I'm happy to revisit.

meatballs · 2018-02-01T12:08:30Z

All looks good to me. Nice piece of work!

drvinceknight · 2018-02-01T14:02:49Z

All looks good to me. Nice piece of work!

Great, thanks for taking the time @meatballs 👍

Once we're happy with this (@marcharper please don't hesitate to point out anything missing), we can merge and then open a PR from the possible-v4 in to master. Not sure why I felt the need for an intermediate branch...

drvinceknight added 2 commits January 29, 2018 18:21

Add missing requirements and mocks for sphinx

825d16d

drvinceknight force-pushed the rewrite-result-set branch from 1a63274 to 825d16d Compare January 29, 2018 19:12

marcharper requested changes Jan 30, 2018

View reviewed changes

drvinceknight added 4 commits January 30, 2018 07:36

Add iu.read_interactions_from_file to docs.

843b008

This highlights it as a user facing function but I'm happy to do something further.

Use defaultdict.

5137a53

Modify file comparison to not be shallow.

7bdb9f8

Hopefully this now works on windows.

Use pd.Dataframe.equals

2634639

Attempting to make file comparison work on windows.

marcharper requested changes Feb 1, 2018

View reviewed changes

drvinceknight added 5 commits February 1, 2018 08:52

Move swapping of states order to outer context

24a6739

Note I did not use `reversed()`, this doesn't return a tuple, I also didn't use `.reverse()` this only works on lists.

Fix spacing and minor refactor.

e44811e

Add a comment explaining the filtering.

5d05f2e

Use np namespace.

b8b5fa0

Abstract various functions where possible.

05f11f6

marcharper approved these changes Feb 3, 2018

View reviewed changes

marcharper merged commit 43cfc53 into possible-v4 Feb 3, 2018

drvinceknight mentioned this pull request Feb 5, 2018

Rewrite the result set #1166

Merged

marcharper deleted the rewrite-result-set branch August 17, 2018 02:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re write the result set #1165

Re write the result set #1165

drvinceknight commented Jan 29, 2018 •

edited

marcharper left a comment

marcharper Jan 30, 2018

drvinceknight Jan 30, 2018

marcharper Jan 30, 2018

drvinceknight Jan 30, 2018

marcharper left a comment

marcharper Jan 31, 2018

marcharper Feb 1, 2018

drvinceknight Feb 1, 2018

marcharper Feb 1, 2018

drvinceknight Feb 1, 2018

marcharper Feb 1, 2018

drvinceknight Feb 1, 2018

marcharper Feb 1, 2018

marcharper Feb 1, 2018

marcharper Feb 1, 2018

marcharper Feb 1, 2018

marcharper Feb 1, 2018

marcharper Feb 1, 2018

drvinceknight commented Feb 1, 2018 •

edited

meatballs commented Feb 1, 2018

drvinceknight commented Feb 1, 2018


		from numpy import mean, nanmedian, std
		from numpy import mean, nanmedian, std, array, nan_to_num

Re write the result set #1165

Re write the result set #1165

Conversation

drvinceknight commented Jan 29, 2018 • edited

marcharper left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcharper left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drvinceknight commented Feb 1, 2018 • edited

meatballs commented Feb 1, 2018

drvinceknight commented Feb 1, 2018

drvinceknight commented Jan 29, 2018 •

edited

drvinceknight commented Feb 1, 2018 •

edited