Skip to content
This repository has been archived by the owner on Feb 2, 2024. It is now read-only.

Remove spark dependency #102

Merged

Conversation

Vyacheslav-Smirnov
Copy link
Contributor

Spark is used for only 1 df generation for tests execution.
There is a problem to get spark works on Windows.
So remove spark dependency from HPAT and add generated
sdf_dt.pq data frame to HPAT repo instead.

Vyacheslav-Smirnov and others added 2 commits July 31, 2019 16:19
Merge changes from origin repo
Spark is used for only 1 df generation for tests execution.
There is a problem to get spark works on Windows.
So remove spark dependency from HPAT and add generated
sdf_dt.pq data frame to HPAT repo instead.
@shssf
Copy link
Contributor

shssf commented Jul 31, 2019

  1. Do not put binary data on git
  2. I see here algorithm is changed. Previously DF generated by interfaces but in new version it just unpack binary data from archive.
    I don't think this is correct approach.

Copy link
Contributor

@shssf shssf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think algorithm should remains particular the same (create data instead unpack it).

@fschlimb
Copy link
Contributor

I had suggested using a small archive with input data rather than creating a dependency on spark. The archive is really small so not a big deal. We didn't see a good way to quickly generate spark-data without spark. Sugegestions are welcome.

What is wrong what is wrong about pre-generated input data (except maybe that larger binaries are problematic in git if they change frequently) ?

@shssf
Copy link
Contributor

shssf commented Jul 31, 2019

@fschlimb How many tests we have which are depends on this Spark data? I mean, may be better to put them into separate "module/directory" and run them if "spark requirements" satisfied (+ generate data for them "on the fly" instead "fixed file with data")?

I would propose to avoid using pre-generated binary data because in long-term it might brings more pain than algorithm to generate it.

At least, please put that archive into subfolder. For example, "data" or "static data".

In case of "binary in git":
The problems begin when git needs to generate diffs and merges: git cannot generate meaningful diffs, or merge binary files in any way that could make sense. So all merges, rebases or cherrypicks involving a change to a binary file will involve you making a manual conflict resolution on that binary file.
Also, you need to decide whether the binary file changes are rare enough that you can live with the extra manual work they cause in the normal git workflow involving merges, rebases, cherrypicks.

@fschlimb
Copy link
Contributor

@shssf Yes, I agree with all you said, I only think in this particular case risk for causing issues is very low, not worth skipping a test. Both is fine with me, though.

Copy link
Contributor

@fschlimb fschlimb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should @shssf agree to the archive apporach, please add a comment telling why we're doing this.

@fschlimb fschlimb merged commit dd440cf into IntelPython:master Aug 1, 2019
Vyacheslav-Smirnov added a commit to Vyacheslav-Smirnov/sdc that referenced this pull request Aug 7, 2019
Remove spark dependency from HPA; use pre-generated sdf_dt.pq
Vyacheslav-Smirnov added a commit to Vyacheslav-Smirnov/sdc that referenced this pull request Aug 7, 2019
Vyacheslav-Smirnov added a commit to Vyacheslav-Smirnov/sdc that referenced this pull request Aug 7, 2019
shssf pushed a commit that referenced this pull request Aug 8, 2019
* Remove spark dependency (#102)

Remove spark dependency from HPA; use pre-generated sdf_dt.pq

* explicitly adding data-file (#104)

* HPAT Build: Code style check for C and Python sources (#103)

* HPAT Build: Code style check for C and Python sources

* PR103. Comments partially addressed

* Code style change part 1 (#106)

* Style check config fo pystyle (#105)

* Fix for pandas.merge wrong overload handling of 'on' args (#99)

Problem description: merge_overload and merge_asof_overload functions
use 'on' argument value to compute 'left_on' and 'right_on' arguments
in a way that breaks type stability, causing compilation failure
when 'on' is assigned a StringLiteral value.

Error:
  File "../hpat/hiframes/dataframe_pass.py", line 202, in _run_assign
    return self._run_call(assign, lhs, rhs)
  File "../hpat/hiframes/dataframe_pass.py", line 522, in _run_call
    return self._run_call_join(assign, lhs, rhs)
  File "../hpat/hiframes/dataframe_pass.py", line 1488, in
_run_call_join
    left_on = self._get_const_or_list(left_on_var)
  File "../hpat/hiframes/dataframe_pass.py", line 2135, in
_get_const_or_list
    raise ValueError(err_msg)
ValueError: Failed in hpat mode pipeline (step: typed dataframe pass)
None

Following tests should be fixed with this commit:
    test_join_cat1 (hpat.tests.test_join.TestJoin)
    test_join_cat2 (hpat.tests.test_join.TestJoin)
    test_join_cat_parallel1 (hpat.tests.test_join.TestJoin)
    test_join_datetime_seq1 (hpat.tests.test_join.TestJoin)
    test_join_left_seq1 (hpat.tests.test_join.TestJoin)
    test_join_left_seq2 (hpat.tests.test_join.TestJoin)
    test_join_outer_seq1 (hpat.tests.test_join.TestJoin)
    test_join_right_seq1 (hpat.tests.test_join.TestJoin)
    test_merge_asof_seq1 (hpat.tests.test_join.TestJoin)

* [STL] PEP8 code style for 'test_strings.py', 'test_utils.py', 'test_series.py' (#97)

* pep8 style for 'test_strings.py'; flake8 check successful

* pep8 style for 'test_utils.py'

* pep8 style for 'test_series.py'; more readable

* fixed 'test_string_series'

* removed extra white spaces

* deleted mention of flake8

* trigger build

* Code style change part 2 (#107)

* code_style_change_part_2

* Add more check in style configuration (#108)

* code_style_part_3 (#109)

* Fix boost runtime issue on Ubuntu16.04 with gcc 5.4 (#92)

* Code style change part 4 (#110)

* Revert "Code style change part 4 (#110)"

This reverts commit dfc54ee.

* Revert "Fix boost runtime issue on Ubuntu16.04 with gcc 5.4 (#92)"

This reverts commit 231a76c.

* Revert "code_style_part_3 (#109)"

This reverts commit 4070ce3.

* Revert "Add more check in style configuration (#108)"

This reverts commit abf5bd0.

* Revert "Code style change part 2 (#107)"

This reverts commit 9076493.

* Revert "[STL] PEP8 code style for 'test_strings.py', 'test_utils.py', 'test_series.py' (#97)"

This reverts commit 8641f7a.

* Revert "Fix for pandas.merge wrong overload handling of 'on' args (#99)"

This reverts commit a2a8ee5.

* Revert "Style check config fo pystyle (#105)"

This reverts commit 551c0e3.

* Revert "Code style change part 1 (#106)"

This reverts commit 6dae0b3.

* Revert "HPAT Build: Code style check for C and Python sources (#103)"

This reverts commit 1a30e4f.

* Revert "explicitly adding data-file (#104)"

This reverts commit 34a2260.

* Revert "Remove spark dependency (#102)"

This reverts commit 9e77fde.

* Unskip passing dataframe tests

Actually, following tests in test_dataframe.py are pass and can be unskipped:
test_create1
test_len1
test_column_getitem1
test_df_apply
test_df_apply_branch
test_df_describe
test_count1
test_append1

At the same time, test_sort_parallel and test_sort_parallel_single_col
has some problems with __pycache__:
 - They are passed if execute the suite with -B
 - They are passed if execute them separate
 - They and failed when some tests above are unskipped.
So, decide to skip them.
shssf pushed a commit that referenced this pull request Aug 13, 2019
* Remove spark dependency (#102)

Remove spark dependency from HPA; use pre-generated sdf_dt.pq

* explicitly adding data-file (#104)

* HPAT Build: Code style check for C and Python sources (#103)

* HPAT Build: Code style check for C and Python sources

* PR103. Comments partially addressed

* Code style change part 1 (#106)

* Style check config fo pystyle (#105)

* Fix for pandas.merge wrong overload handling of 'on' args (#99)

Problem description: merge_overload and merge_asof_overload functions
use 'on' argument value to compute 'left_on' and 'right_on' arguments
in a way that breaks type stability, causing compilation failure
when 'on' is assigned a StringLiteral value.

Error:
  File "../hpat/hiframes/dataframe_pass.py", line 202, in _run_assign
    return self._run_call(assign, lhs, rhs)
  File "../hpat/hiframes/dataframe_pass.py", line 522, in _run_call
    return self._run_call_join(assign, lhs, rhs)
  File "../hpat/hiframes/dataframe_pass.py", line 1488, in
_run_call_join
    left_on = self._get_const_or_list(left_on_var)
  File "../hpat/hiframes/dataframe_pass.py", line 2135, in
_get_const_or_list
    raise ValueError(err_msg)
ValueError: Failed in hpat mode pipeline (step: typed dataframe pass)
None

Following tests should be fixed with this commit:
    test_join_cat1 (hpat.tests.test_join.TestJoin)
    test_join_cat2 (hpat.tests.test_join.TestJoin)
    test_join_cat_parallel1 (hpat.tests.test_join.TestJoin)
    test_join_datetime_seq1 (hpat.tests.test_join.TestJoin)
    test_join_left_seq1 (hpat.tests.test_join.TestJoin)
    test_join_left_seq2 (hpat.tests.test_join.TestJoin)
    test_join_outer_seq1 (hpat.tests.test_join.TestJoin)
    test_join_right_seq1 (hpat.tests.test_join.TestJoin)
    test_merge_asof_seq1 (hpat.tests.test_join.TestJoin)

* [STL] PEP8 code style for 'test_strings.py', 'test_utils.py', 'test_series.py' (#97)

* pep8 style for 'test_strings.py'; flake8 check successful

* pep8 style for 'test_utils.py'

* pep8 style for 'test_series.py'; more readable

* fixed 'test_string_series'

* removed extra white spaces

* deleted mention of flake8

* trigger build

* Code style change part 2 (#107)

* code_style_change_part_2

* Add more check in style configuration (#108)

* code_style_part_3 (#109)

* Fix boost runtime issue on Ubuntu16.04 with gcc 5.4 (#92)

* Code style change part 4 (#110)

* Cahnge tests execution

Actually test suite should be executed via hpat.runtests:
python -u -m hpat.runtests -v
This resolve the issue with doulbe test suite execution
which occurs due to the "python -u -m unittest -v" command
import all files in tree including runtests.py and runtests.py
triggers 1-st suite execution. Then unittest triggers 2-d.

Add decorator to execute some tests (mostly parallel)
2 or more times (depending on existance of REPEAT_TEST_NUMBER
environment variable)
This should highlight issues like the test fails if executed twice
because is corrupts memory during first execution
(like test_series_head_index_parallel1)

Skip test_series_head_index_parallel1 because it triggers memory
corruption. This should be fixed.

* Revert "Code style change part 4 (#110)"

This reverts commit dfc54ee.

* Revert "Fix boost runtime issue on Ubuntu16.04 with gcc 5.4 (#92)"

This reverts commit 231a76c.

* Revert "code_style_part_3 (#109)"

This reverts commit 4070ce3.

* Revert "Add more check in style configuration (#108)"

This reverts commit abf5bd0.

* Revert "Code style change part 2 (#107)"

This reverts commit 9076493.

* Revert "[STL] PEP8 code style for 'test_strings.py', 'test_utils.py', 'test_series.py' (#97)"

This reverts commit 8641f7a.

* Revert "Fix for pandas.merge wrong overload handling of 'on' args (#99)"

This reverts commit a2a8ee5.

* Revert "Style check config fo pystyle (#105)"

This reverts commit 551c0e3.

* Revert "Code style change part 1 (#106)"

This reverts commit 6dae0b3.

* Revert "HPAT Build: Code style check for C and Python sources (#103)"

This reverts commit 1a30e4f.

* Revert "explicitly adding data-file (#104)"

This reverts commit 34a2260.

* Revert "Remove spark dependency (#102)"

This reverts commit 9e77fde.

* Wrap functions to be executed twice in runtests.py

* Update runtests.py

Execute every test specified times, which is set via
the REPEAT_TEST_NUMBER environment variable.

Skip test_series_list_str_unbox1 because is fails on the second
launch with Segmentation fault

* Apply comments from review

Rename REPEAT_TEST_NUMBER to HPAT_REPEAT_TEST_NUMBER
Use os.getenv to get value for HPAT_REPEAT_TEST_NUMBER
kozlov-alexey pushed a commit to kozlov-alexey/sdc that referenced this pull request Oct 4, 2019
Remove spark dependency from HPA; use pre-generated sdf_dt.pq
kozlov-alexey pushed a commit to kozlov-alexey/sdc that referenced this pull request Oct 4, 2019
* Remove spark dependency (IntelPython#102)

Remove spark dependency from HPA; use pre-generated sdf_dt.pq

* explicitly adding data-file (IntelPython#104)

* HPAT Build: Code style check for C and Python sources (IntelPython#103)

* HPAT Build: Code style check for C and Python sources

* PR103. Comments partially addressed

* Code style change part 1 (IntelPython#106)

* Style check config fo pystyle (IntelPython#105)

* Fix for pandas.merge wrong overload handling of 'on' args (IntelPython#99)

Problem description: merge_overload and merge_asof_overload functions
use 'on' argument value to compute 'left_on' and 'right_on' arguments
in a way that breaks type stability, causing compilation failure
when 'on' is assigned a StringLiteral value.

Error:
  File "../hpat/hiframes/dataframe_pass.py", line 202, in _run_assign
    return self._run_call(assign, lhs, rhs)
  File "../hpat/hiframes/dataframe_pass.py", line 522, in _run_call
    return self._run_call_join(assign, lhs, rhs)
  File "../hpat/hiframes/dataframe_pass.py", line 1488, in
_run_call_join
    left_on = self._get_const_or_list(left_on_var)
  File "../hpat/hiframes/dataframe_pass.py", line 2135, in
_get_const_or_list
    raise ValueError(err_msg)
ValueError: Failed in hpat mode pipeline (step: typed dataframe pass)
None

Following tests should be fixed with this commit:
    test_join_cat1 (hpat.tests.test_join.TestJoin)
    test_join_cat2 (hpat.tests.test_join.TestJoin)
    test_join_cat_parallel1 (hpat.tests.test_join.TestJoin)
    test_join_datetime_seq1 (hpat.tests.test_join.TestJoin)
    test_join_left_seq1 (hpat.tests.test_join.TestJoin)
    test_join_left_seq2 (hpat.tests.test_join.TestJoin)
    test_join_outer_seq1 (hpat.tests.test_join.TestJoin)
    test_join_right_seq1 (hpat.tests.test_join.TestJoin)
    test_merge_asof_seq1 (hpat.tests.test_join.TestJoin)

* [STL] PEP8 code style for 'test_strings.py', 'test_utils.py', 'test_series.py' (IntelPython#97)

* pep8 style for 'test_strings.py'; flake8 check successful

* pep8 style for 'test_utils.py'

* pep8 style for 'test_series.py'; more readable

* fixed 'test_string_series'

* removed extra white spaces

* deleted mention of flake8

* trigger build

* Code style change part 2 (IntelPython#107)

* code_style_change_part_2

* Add more check in style configuration (IntelPython#108)

* code_style_part_3 (IntelPython#109)

* Fix boost runtime issue on Ubuntu16.04 with gcc 5.4 (IntelPython#92)

* Code style change part 4 (IntelPython#110)

* Revert "Code style change part 4 (IntelPython#110)"

This reverts commit dfc54ee.

* Revert "Fix boost runtime issue on Ubuntu16.04 with gcc 5.4 (IntelPython#92)"

This reverts commit 231a76c.

* Revert "code_style_part_3 (IntelPython#109)"

This reverts commit 4070ce3.

* Revert "Add more check in style configuration (IntelPython#108)"

This reverts commit abf5bd0.

* Revert "Code style change part 2 (IntelPython#107)"

This reverts commit 9076493.

* Revert "[STL] PEP8 code style for 'test_strings.py', 'test_utils.py', 'test_series.py' (IntelPython#97)"

This reverts commit 8641f7a.

* Revert "Fix for pandas.merge wrong overload handling of 'on' args (IntelPython#99)"

This reverts commit a2a8ee5.

* Revert "Style check config fo pystyle (IntelPython#105)"

This reverts commit 551c0e3.

* Revert "Code style change part 1 (IntelPython#106)"

This reverts commit 6dae0b3.

* Revert "HPAT Build: Code style check for C and Python sources (IntelPython#103)"

This reverts commit 1a30e4f.

* Revert "explicitly adding data-file (IntelPython#104)"

This reverts commit 34a2260.

* Revert "Remove spark dependency (IntelPython#102)"

This reverts commit 9e77fde.

* Unskip passing dataframe tests

Actually, following tests in test_dataframe.py are pass and can be unskipped:
test_create1
test_len1
test_column_getitem1
test_df_apply
test_df_apply_branch
test_df_describe
test_count1
test_append1

At the same time, test_sort_parallel and test_sort_parallel_single_col
has some problems with __pycache__:
 - They are passed if execute the suite with -B
 - They are passed if execute them separate
 - They and failed when some tests above are unskipped.
So, decide to skip them.
kozlov-alexey pushed a commit to kozlov-alexey/sdc that referenced this pull request Oct 4, 2019
* Remove spark dependency (IntelPython#102)

Remove spark dependency from HPA; use pre-generated sdf_dt.pq

* explicitly adding data-file (IntelPython#104)

* HPAT Build: Code style check for C and Python sources (IntelPython#103)

* HPAT Build: Code style check for C and Python sources

* PR103. Comments partially addressed

* Code style change part 1 (IntelPython#106)

* Style check config fo pystyle (IntelPython#105)

* Fix for pandas.merge wrong overload handling of 'on' args (IntelPython#99)

Problem description: merge_overload and merge_asof_overload functions
use 'on' argument value to compute 'left_on' and 'right_on' arguments
in a way that breaks type stability, causing compilation failure
when 'on' is assigned a StringLiteral value.

Error:
  File "../hpat/hiframes/dataframe_pass.py", line 202, in _run_assign
    return self._run_call(assign, lhs, rhs)
  File "../hpat/hiframes/dataframe_pass.py", line 522, in _run_call
    return self._run_call_join(assign, lhs, rhs)
  File "../hpat/hiframes/dataframe_pass.py", line 1488, in
_run_call_join
    left_on = self._get_const_or_list(left_on_var)
  File "../hpat/hiframes/dataframe_pass.py", line 2135, in
_get_const_or_list
    raise ValueError(err_msg)
ValueError: Failed in hpat mode pipeline (step: typed dataframe pass)
None

Following tests should be fixed with this commit:
    test_join_cat1 (hpat.tests.test_join.TestJoin)
    test_join_cat2 (hpat.tests.test_join.TestJoin)
    test_join_cat_parallel1 (hpat.tests.test_join.TestJoin)
    test_join_datetime_seq1 (hpat.tests.test_join.TestJoin)
    test_join_left_seq1 (hpat.tests.test_join.TestJoin)
    test_join_left_seq2 (hpat.tests.test_join.TestJoin)
    test_join_outer_seq1 (hpat.tests.test_join.TestJoin)
    test_join_right_seq1 (hpat.tests.test_join.TestJoin)
    test_merge_asof_seq1 (hpat.tests.test_join.TestJoin)

* [STL] PEP8 code style for 'test_strings.py', 'test_utils.py', 'test_series.py' (IntelPython#97)

* pep8 style for 'test_strings.py'; flake8 check successful

* pep8 style for 'test_utils.py'

* pep8 style for 'test_series.py'; more readable

* fixed 'test_string_series'

* removed extra white spaces

* deleted mention of flake8

* trigger build

* Code style change part 2 (IntelPython#107)

* code_style_change_part_2

* Add more check in style configuration (IntelPython#108)

* code_style_part_3 (IntelPython#109)

* Fix boost runtime issue on Ubuntu16.04 with gcc 5.4 (IntelPython#92)

* Code style change part 4 (IntelPython#110)

* Cahnge tests execution

Actually test suite should be executed via hpat.runtests:
python -u -m hpat.runtests -v
This resolve the issue with doulbe test suite execution
which occurs due to the "python -u -m unittest -v" command
import all files in tree including runtests.py and runtests.py
triggers 1-st suite execution. Then unittest triggers 2-d.

Add decorator to execute some tests (mostly parallel)
2 or more times (depending on existance of REPEAT_TEST_NUMBER
environment variable)
This should highlight issues like the test fails if executed twice
because is corrupts memory during first execution
(like test_series_head_index_parallel1)

Skip test_series_head_index_parallel1 because it triggers memory
corruption. This should be fixed.

* Revert "Code style change part 4 (IntelPython#110)"

This reverts commit dfc54ee.

* Revert "Fix boost runtime issue on Ubuntu16.04 with gcc 5.4 (IntelPython#92)"

This reverts commit 231a76c.

* Revert "code_style_part_3 (IntelPython#109)"

This reverts commit 4070ce3.

* Revert "Add more check in style configuration (IntelPython#108)"

This reverts commit abf5bd0.

* Revert "Code style change part 2 (IntelPython#107)"

This reverts commit 9076493.

* Revert "[STL] PEP8 code style for 'test_strings.py', 'test_utils.py', 'test_series.py' (IntelPython#97)"

This reverts commit 8641f7a.

* Revert "Fix for pandas.merge wrong overload handling of 'on' args (IntelPython#99)"

This reverts commit a2a8ee5.

* Revert "Style check config fo pystyle (IntelPython#105)"

This reverts commit 551c0e3.

* Revert "Code style change part 1 (IntelPython#106)"

This reverts commit 6dae0b3.

* Revert "HPAT Build: Code style check for C and Python sources (IntelPython#103)"

This reverts commit 1a30e4f.

* Revert "explicitly adding data-file (IntelPython#104)"

This reverts commit 34a2260.

* Revert "Remove spark dependency (IntelPython#102)"

This reverts commit 9e77fde.

* Wrap functions to be executed twice in runtests.py

* Update runtests.py

Execute every test specified times, which is set via
the REPEAT_TEST_NUMBER environment variable.

Skip test_series_list_str_unbox1 because is fails on the second
launch with Segmentation fault

* Apply comments from review

Rename REPEAT_TEST_NUMBER to HPAT_REPEAT_TEST_NUMBER
Use os.getenv to get value for HPAT_REPEAT_TEST_NUMBER
@Vyacheslav-Smirnov Vyacheslav-Smirnov deleted the feature/remove_spark branch February 20, 2020 10:54
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants